PROJET DE SOUTENANCE - DATAGONG x 10.000 Codeurs¶

Par Abdou-Raouf ATARMLA & Corneille HUEHA

0. PRESENTATION¶

Sujet : Prédiction du parti politique victorieux des élection présidentielles de 2020 aux USA à partir de données socio-démographiques

Plan de travail :

  • Setup (Installation et import des librairies nécessaires)
  • Import des données
  • Préparation & Constitution de la donnée exploitable
  • Analyse exploratoire
  • Modélisation
  • Evaluation
  • Conclusion

1. Setup¶

1.1. INSTALLATION DES DEPENDANCES¶

Nous devons installer les librairies suivantes :

  • pandas
  • numpy
  • scikit-learn
  • xlrd
  • openpyxl
  • matplotlib
  • seaborn
  • shap
  • xgboost

Pour cela, nous pouvons utiliser la commande suivante :

pip install -r requirements.txt
In [244]:
# Commande pour installer les dépendances
!pip install -r requirements.txt
Requirement already satisfied: pandas in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (2.2.3)
Requirement already satisfied: numpy==2.1.0 in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 2)) (2.1.0)
Requirement already satisfied: scikit-learn in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 3)) (1.6.1)
Requirement already satisfied: xlrd in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 4)) (2.0.1)
Requirement already satisfied: openpyxl in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (3.1.5)
Requirement already satisfied: matplotlib in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (3.10.0)
Requirement already satisfied: seaborn in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (0.13.2)
Requirement already satisfied: shap in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (0.46.0)
Requirement already satisfied: xgboost in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (2.1.4)
Requirement already satisfied: imbalanced-learn in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (0.13.0)
Requirement already satisfied: nbconvert in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (7.16.6)
Requirement already satisfied: python-dateutil>=2.8.2 in ./mvenv/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 1)) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in ./mvenv/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 1)) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in ./mvenv/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 1)) (2025.1)
Requirement already satisfied: scipy>=1.6.0 in ./mvenv/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 3)) (1.15.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in ./mvenv/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 3)) (3.5.0)
Requirement already satisfied: joblib>=1.2.0 in ./mvenv/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 3)) (1.4.2)
Requirement already satisfied: et-xmlfile in ./mvenv/lib/python3.10/site-packages (from openpyxl->-r requirements.txt (line 5)) (2.0.0)
Requirement already satisfied: fonttools>=4.22.0 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (4.56.0)
Requirement already satisfied: pyparsing>=2.3.1 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (3.2.1)
Requirement already satisfied: cycler>=0.10 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (0.12.1)
Requirement already satisfied: contourpy>=1.0.1 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (1.3.1)
Requirement already satisfied: packaging>=20.0 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (24.2)
Requirement already satisfied: kiwisolver>=1.3.1 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (1.4.8)
Requirement already satisfied: pillow>=8 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (11.1.0)
Requirement already satisfied: slicer==0.0.8 in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (0.0.8)
Requirement already satisfied: numba in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (0.61.0)
Requirement already satisfied: tqdm>=4.27.0 in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (4.67.1)
Requirement already satisfied: cloudpickle in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (3.1.1)
Requirement already satisfied: nvidia-nccl-cu12 in ./mvenv/lib/python3.10/site-packages (from xgboost->-r requirements.txt (line 9)) (2.25.1)
Requirement already satisfied: sklearn-compat<1,>=0.1 in ./mvenv/lib/python3.10/site-packages (from imbalanced-learn->-r requirements.txt (line 10)) (0.1.3)
Requirement already satisfied: jupyterlab-pygments in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (0.3.0)
Requirement already satisfied: defusedxml in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (0.7.1)
Requirement already satisfied: markupsafe>=2.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (3.0.2)
Requirement already satisfied: jinja2>=3.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (3.1.5)
Requirement already satisfied: pygments>=2.4.1 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (2.19.1)
Requirement already satisfied: traitlets>=5.1 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (5.14.3)
Requirement already satisfied: nbclient>=0.5.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (0.10.2)
Requirement already satisfied: nbformat>=5.7 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (5.10.4)
Requirement already satisfied: jupyter-core>=4.7 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (5.7.2)
Requirement already satisfied: bleach[css]!=5.0.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (6.2.0)
Requirement already satisfied: pandocfilters>=1.4.1 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (1.5.1)
Requirement already satisfied: beautifulsoup4 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (4.13.3)
Requirement already satisfied: mistune<4,>=2.0.3 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (3.1.2)
Requirement already satisfied: webencodings in ./mvenv/lib/python3.10/site-packages (from bleach[css]!=5.0.0->nbconvert->-r requirements.txt (line 11)) (0.5.1)
Requirement already satisfied: tinycss2<1.5,>=1.1.0 in ./mvenv/lib/python3.10/site-packages (from bleach[css]!=5.0.0->nbconvert->-r requirements.txt (line 11)) (1.4.0)
Requirement already satisfied: platformdirs>=2.5 in ./mvenv/lib/python3.10/site-packages (from jupyter-core>=4.7->nbconvert->-r requirements.txt (line 11)) (4.3.6)
Requirement already satisfied: typing-extensions in ./mvenv/lib/python3.10/site-packages (from mistune<4,>=2.0.3->nbconvert->-r requirements.txt (line 11)) (4.12.2)
Requirement already satisfied: jupyter-client>=6.1.12 in ./mvenv/lib/python3.10/site-packages (from nbclient>=0.5.0->nbconvert->-r requirements.txt (line 11)) (8.6.3)
Requirement already satisfied: fastjsonschema>=2.15 in ./mvenv/lib/python3.10/site-packages (from nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (2.21.1)
Requirement already satisfied: jsonschema>=2.6 in ./mvenv/lib/python3.10/site-packages (from nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (4.23.0)
Requirement already satisfied: six>=1.5 in ./mvenv/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->-r requirements.txt (line 1)) (1.17.0)
Requirement already satisfied: soupsieve>1.2 in ./mvenv/lib/python3.10/site-packages (from beautifulsoup4->nbconvert->-r requirements.txt (line 11)) (2.6)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in ./mvenv/lib/python3.10/site-packages (from numba->shap->-r requirements.txt (line 8)) (0.44.0)
Requirement already satisfied: rpds-py>=0.7.1 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (0.23.1)
Requirement already satisfied: attrs>=22.2.0 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (25.1.0)
Requirement already satisfied: referencing>=0.28.4 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (0.36.2)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (2024.10.1)
Requirement already satisfied: pyzmq>=23.0 in ./mvenv/lib/python3.10/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert->-r requirements.txt (line 11)) (26.2.1)
Requirement already satisfied: tornado>=6.2 in ./mvenv/lib/python3.10/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert->-r requirements.txt (line 11)) (6.4.2)

1.2. IMPORTATION DES LIBRAIRIES¶

In [245]:
import warnings
warnings.filterwarnings("ignore")
In [246]:
# Importation des bibliothèques principales
  
import numpy as np  # Manipulation de tableaux numériques, gestion des calculs mathématiques  
import pandas as pd  # Gestion et manipulation des données sous forme de DataFrame  

# Visualisation des données  
import seaborn as sns  # Création de graphiques statistiques avancés  
import matplotlib.pyplot as plt  # Génération de visualisations basiques (histogrammes, scatter plots, etc.)  

# Préparation des données et séparation en train/test  
from sklearn.model_selection import train_test_split  # Découpage des données en ensembles d'entraînement et de test  
from sklearn.preprocessing import OneHotEncoder  # Encodage des variables catégorielles en variables numériques  

# Modèle de base (Régression Logistique)  
from sklearn.linear_model import LogisticRegression  # Modèle de régression logistique pour la classification  

# Évaluation des modèles  
from sklearn.metrics import classification_report  # Génération d’un rapport de performance des modèles  

# Gestion des classes déséquilibrées  
from sklearn.utils import resample  # Sous-échantillonnage de la classe majoritaire (undersampling)  
from imblearn.over_sampling import SMOTE  # Sur-échantillonnage de la classe minoritaire (oversampling)  

# Modèles avancés  
from sklearn.ensemble import RandomForestClassifier  # Modèle de Random Forest pour la classification  
from xgboost import XGBClassifier  # Modèle XGBoost pour une classification optimisée  

# Optimisation des modèles  
from sklearn.model_selection import GridSearchCV  # Recherche des meilleurs hyperparamètres avec validation croisée  
from imblearn.pipeline import Pipeline as ImbPipeline  # Création d'un pipeline intégrant le prétraitement et le modèle  

# Interprétabilité des modèles  
import shap  # Explication des prédictions des modèles avec SHAP (SHapley Additive exPlanations)  

# Profilage
from tqdm import tqdm  # Affichage de barres de progression pour suivre l’exécution des 

import matplotlib.gridspec as gridspec

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.pipeline import Pipeline

2. IMPORTATION DES DONNÉES¶

In [247]:
# Résultats des élections présidentielles de 2020 au niveau des comtés
elections_2020 = pd.read_csv('./data/2020_US_County_Level_Presidential_Results.csv')

# Résultats des élections présidentielles de 2008 à 2016 au niveau des comtés
# Ce fichier ne sera utilisé que pour l’analyse exploratoire (comparaison des tendances)
elections_08_16 = pd.read_csv('./data/US_County_Level_Presidential_Results_08-16.csv')

# Données démographiques : Estimations de la population par comté
population = pd.read_excel('./data/PopulationEstimates.xls', engine='xlrd', header=2)

# Données sur l’éducation : niveaux d’études atteints par comté
education = pd.read_excel('./data/Education.xls', engine='xlrd', header=4)

# Données sur la pauvreté : taux de pauvreté par comté
poverty = pd.read_excel('./data/PovertyEstimates.xls', engine='xlrd', header=4)

# Données sur le chômage : taux de chômage par comté
unemployment = pd.read_excel('./data/Unemployment.xls', engine='xlrd', header=4)

3. PREPARATION ET CONSTITUTION DES DONNEES¶

3.1. Etape 1¶

In [248]:
# Harmonisation des colonnes pour assurer une cohérence entre les datasets

# Sélection et renommage des colonnes pour les résultats des élections 2020
df_2020 = elections_2020[['county_fips', 'county_name', 'state_name']].rename(
    columns={'county_fips': 'fips', 'county_name': 'county_name', 'state_name': 'state_name'}
)

# Sélection et renommage des colonnes pour les résultats des élections 2008-2016
df_08_16 = elections_08_16[['fips_code', 'county']].rename(
    columns={'fips_code': 'fips', 'county': 'county_name'}
)

# Sélection et renommage des colonnes pour les données de population
df_population = population[['FIPStxt', 'Area_Name', 'State']].rename(
    columns={'FIPStxt': 'fips', 'Area_Name': 'county_name', 'State': 'state_code'}
)

# Sélection et renommage des colonnes pour les données d'éducation
df_education = education[['FIPS Code', 'Area name', 'State']].rename(
    columns={'FIPS Code': 'fips', 'Area name': 'county_name', 'State': 'state_code'}
)

# Sélection et renommage des colonnes pour les données de pauvreté
df_poverty = poverty[['FIPStxt', 'Area_name', 'Stabr']].rename(
    columns={'FIPStxt': 'fips', 'Area_name': 'county_name', 'Stabr': 'state_code'}
)

# Sélection et renommage des colonnes pour les données de chômage
df_unemployment = unemployment[['fips_txt', 'area_name', 'Stabr']].rename(
    columns={'fips_txt': 'fips', 'area_name': 'county_name', 'Stabr': 'state_code'}
)


# Concaténation de tous les jeux de données en un seul DataFrame
checkpoint_0_raw = pd.concat([
    df_2020,
    df_08_16,
    df_population,
    df_education,
    df_poverty,
    df_unemployment
], ignore_index=True)
In [249]:
checkpoint_0_raw
Out[249]:
fips county_name state_name state_code
0 1001 Autauga County Alabama NaN
1 1003 Baldwin County Alabama NaN
2 1005 Barbour County Alabama NaN
3 1007 Bibb County Alabama NaN
4 1009 Blount County Alabama NaN
... ... ... ... ...
19283 72145 Vega Baja Municipio, PR NaN PR
19284 72147 Vieques Municipio, PR NaN PR
19285 72149 Villalba Municipio, PR NaN PR
19286 72151 Yabucoa Municipio, PR NaN PR
19287 72153 Yauco Municipio, PR NaN PR

19288 rows × 4 columns

In [250]:
# Harmonisation du code FIPS sur 5 caractères (ajout de zéros devant si nécessaire)
checkpoint_0_raw['fips'] = checkpoint_0_raw['fips'].astype(str).str.zfill(5)
checkpoint_0_raw
Out[250]:
fips county_name state_name state_code
0 01001 Autauga County Alabama NaN
1 01003 Baldwin County Alabama NaN
2 01005 Barbour County Alabama NaN
3 01007 Bibb County Alabama NaN
4 01009 Blount County Alabama NaN
... ... ... ... ...
19283 72145 Vega Baja Municipio, PR NaN PR
19284 72147 Vieques Municipio, PR NaN PR
19285 72149 Villalba Municipio, PR NaN PR
19286 72151 Yabucoa Municipio, PR NaN PR
19287 72153 Yauco Municipio, PR NaN PR

19288 rows × 4 columns

In [251]:
# Suppression des doublons basés sur le FIPS (identifiant unique des comtés)
checkpoint_0 = checkpoint_0_raw.drop_duplicates(subset=['fips']).sort_values(by='fips').reset_index(drop=True)
In [252]:
# Vérification des premières lignes du DataFrame final
checkpoint_0
Out[252]:
fips county_name state_name state_code
0 00000 United States NaN US
1 01000 Alabama NaN AL
2 01001 Autauga County Alabama NaN
3 01003 Baldwin County Alabama NaN
4 01005 Barbour County Alabama NaN
... ... ... ... ...
3319 72145 Vega Baja Municipio, Puerto Rico NaN PR
3320 72147 Vieques Municipio, Puerto Rico NaN PR
3321 72149 Villalba Municipio, Puerto Rico NaN PR
3322 72151 Yabucoa Municipio, Puerto Rico NaN PR
3323 72153 Yauco Municipio, Puerto Rico NaN PR

3324 rows × 4 columns

In [253]:
# Sauvegarde du DataFrame harmonisé dans un fichier Excel
checkpoint_0.to_excel('checkpoints/save_0.xlsx', index=False)
print("\n✅ Données harmonisées sauvegardées dans 'checkpoints/save_0.xlsx'.")
✅ Données harmonisées sauvegardées dans 'checkpoints/save_0.xlsx'.

3.2. Etape 2¶

In [254]:
# Chargement du fichier harmonisé précédent
ch0 = pd.read_excel('checkpoints/save_0.xlsx')  

# Liste des sources utilisées pour compléter les valeurs manquantes
sources = [df_2020, df_08_16, df_population, df_education, df_poverty, df_unemployment]

# Liste des colonnes à compléter si elles sont absentes
columns_to_fill = ['county_name', 'state_code', 'state_name']

# Vérification et création des colonnes manquantes si elles ne sont pas présentes dans `dm0`
for col in columns_to_fill:
    if col not in ch0.columns:
        ch0[col] = pd.NA  # Remplissage initial avec des valeurs manquantes

# Fonction pour compléter les valeurs manquantes en se basant sur les autres sources
def fill_missing_values(base_df, sources, columns_to_fill):
    """
    Remplit les valeurs manquantes d'un DataFrame en utilisant d'autres sources de données.
    
    - base_df : DataFrame principal contenant des valeurs manquantes
    - sources : Liste des DataFrames sources
    - columns_to_fill : Liste des colonnes à compléter
    """
    for source in sources:
        for col in columns_to_fill:
            if col in source.columns:  # Vérifie si la colonne existe dans la source
                base_df[col] = base_df[col].fillna(
                    base_df['fips'].map(source.set_index('fips')[col])  # Remplissage basé sur la correspondance FIPS
                )
    return base_df
In [255]:
# Application de la fonction pour compléter les valeurs manquantes
checkpoint_1_raw = fill_missing_values(ch0, sources, columns_to_fill)

# Vérification des valeurs manquantes après remplissage
ch1_missing_data = checkpoint_1_raw[columns_to_fill].isnull().sum()
print(f"Valeurs manquantes après remplissage :\n{ch1_missing_data}")
Valeurs manquantes après remplissage :
county_name      0
state_code      41
state_name     172
dtype: int64
In [256]:
# Sauvegarde du fichier complété
checkpoint_1_raw.to_excel('checkpoints/save_1.xlsx', index=False)
print("\n✅ Données complétées sauvegardées dans 'checkpoints/save_1.xlsx'.")
✅ Données complétées sauvegardées dans 'checkpoints/save_1.xlsx'.

3.3. Etape 3¶

In [257]:
df_2020 = elections_2020[['county_fips', 'county_name', 'state_name']].rename(
    columns={'county_fips': 'county_code', 'county_name': 'county_name', 'state_name': 'state_name'})

df_08_16 = elections_08_16[['fips_code', 'county']].rename(
    columns={'fips_code': 'county_code', 'county': 'county_name'})

df_population = population[['FIPStxt', 'Area_Name', 'State']].rename(
    columns={'FIPStxt': 'county_code', 'Area_Name': 'county_name', 'State': 'state_code'})

df_education = education[['FIPS Code', 'Area name', 'State']].rename(
    columns={'FIPS Code': 'county_code', 'Area name': 'county_name', 'State': 'state_code'})

df_poverty = poverty[['FIPStxt', 'Area_name', 'Stabr']].rename(
    columns={'FIPStxt': 'county_code', 'Area_name': 'county_name', 'Stabr': 'state_code'})

df_unemployment = unemployment[['fips_txt', 'area_name', 'Stabr']].rename(
    columns={'fips_txt': 'county_code', 'area_name': 'county_name', 'Stabr': 'state_code'})

# Liste des sources et leur priorité
sources = [df_2020, df_08_16, df_population, df_education, df_poverty, df_unemployment]

# Charger le fichier harmonisé existant ou créer une base vide
ch1 = pd.read_excel('checkpoints/save_1.xlsx')

# Renommer la colonne principale en `county_code`
ch1.rename(columns={'fips': 'county_code'}, inplace=True)

# S'assurer que `county_code` est formaté sur 5 chiffres
ch1['county_code'] = ch1['county_code'].astype(str).str.zfill(5)

# Ajouter les colonnes manquantes
columns_to_fill = ['county_name', 'state_code', 'state_name']
for col in columns_to_fill:
    if col not in ch1.columns:
        ch1[col] = pd.NA

# Fonction pour combler les valeurs manquantes
def fill_missing_values(base_df, sources, columns_to_fill):
    for source in sources:
        # S'assurer que county_code est au bon format dans les sources
        source['county_code'] = source['county_code'].astype(str).str.zfill(5)
        for col in columns_to_fill:
            if col in source.columns:  # Vérifier que la colonne existe dans la source
                base_df[col] = base_df[col].fillna(
                    base_df['county_code'].map(source.set_index('county_code')[col])  # Remplir selon la clé county_code
                )
    return base_df
In [258]:
# Compléter les colonnes manquantes
checkpoint_2_raw = fill_missing_values(ch1, sources, columns_to_fill)

# Vérification des valeurs manquantes
ch2_missing_data = checkpoint_2_raw[columns_to_fill].isnull().sum()
print(f"Valeurs manquantes après remplissage :\n{ch2_missing_data}")
Valeurs manquantes après remplissage :
county_name      0
state_code      41
state_name     172
dtype: int64
In [259]:
# Sauvegarder les données complètes
checkpoint_2_raw.to_excel('checkpoints/save_2.xlsx', index=False)
print("\nDonnées complétées sauvegardées dans 'checkpoints/save_2.xlsx'.")
Données complétées sauvegardées dans 'checkpoints/save_2.xlsx'.

3.4. Etape 4¶

In [260]:
elections_2020 = elections_2020[['county_fips', 'per_gop', 'per_dem']].rename(columns={'county_fips': 'county_code'})
elections_08_16 = elections_08_16[['fips_code', 'total_2016', 'dem_2016', 'gop_2016']].rename(columns={'fips_code': 'county_code'})
population = population[['FIPStxt', 'Rural-urban_Continuum Code_2013', 'Urban_Influence_Code_2013']].rename(
    columns={'FIPStxt': 'county_code', 'Rural-urban_Continuum Code_2013': 'rural_urban_code',
             'Urban_Influence_Code_2013': 'urban_influence_code'})
education = education[['FIPS Code', 'Percent of adults with less than a high school diploma, 2015-19',
                       'Percent of adults with a high school diploma only, 2015-19',
                       'Percent of adults completing some college or associate\'s degree, 2015-19',
                       'Percent of adults with a bachelor\'s degree or higher, 2015-19']].rename(
    columns={'FIPS Code': 'county_code',
             'Percent of adults with less than a high school diploma, 2015-19': 'percent_no_highschool',
             'Percent of adults with a high school diploma only, 2015-19': 'percent_highschool',
             'Percent of adults completing some college or associate\'s degree, 2015-19': 'percent_college',
             'Percent of adults with a bachelor\'s degree or higher, 2015-19': 'percent_bachelor'})
poverty = poverty[['FIPStxt', 'PCTPOVALL_2019', 'MEDHHINC_2019']].rename(
    columns={'FIPStxt': 'county_code', 'PCTPOVALL_2019': 'percent_poverty',
             'MEDHHINC_2019': 'median_household_income'})
unemployment = unemployment[['fips_txt', 'Unemployment_rate_2019', 'Employed_2019', 'Unemployed_2019']].rename(
    columns={'fips_txt': 'county_code', 'Unemployment_rate_2019': 'unemployment_rate'})

# Standardiser le format des `county_code` (5 caractères)
datasets = [elections_2020, elections_08_16, population, education, poverty, unemployment]
for df in datasets:
    df['county_code'] = df['county_code'].astype(str).str.zfill(5)

datamap = pd.read_excel('checkpoints/save_2.xlsx')
datamap['county_code'] = datamap['county_code'].astype(str).str.zfill(5)


# Liste des colonnes pertinentes à ajouter
columns_to_add = {
    'elections_2020': ['per_gop', 'per_dem'],
    'elections_08_16': ['total_2016', 'dem_2016', 'gop_2016'],
    'population': ['rural_urban_code', 'urban_influence_code'],
    'education': ['percent_no_highschool', 'percent_highschool', 'percent_college', 'percent_bachelor'],
    'poverty': ['percent_poverty', 'median_household_income'],
    'unemployment': ['unemployment_rate', 'Employed_2019', 'Unemployed_2019']
}

# Compléter les colonnes manquantes dans le fichier datamap
for source, cols in zip(datasets, columns_to_add.values()):
    for col in cols:
        if col not in datamap.columns:
            datamap[col] = pd.NA
        datamap[col] = datamap[col].fillna(datamap['county_code'].map(source.set_index('county_code')[col]))
print(datamap)
# Sauvegarder le fichier enrichi
datamap.to_excel('checkpoints/save_3.xlsx', index=False)
print("\nFichier 'checkpoints/save_3.xlsx' sauvegardé avec toutes les données intégrées.")
     county_code                       county_name state_name state_code  \
0          00000                     United States        NaN         US   
1          01000                           Alabama        NaN         AL   
2          01001                    Autauga County    Alabama         AL   
3          01003                    Baldwin County    Alabama         AL   
4          01005                    Barbour County    Alabama         AL   
...          ...                               ...        ...        ...   
3319       72145  Vega Baja Municipio, Puerto Rico        NaN         PR   
3320       72147    Vieques Municipio, Puerto Rico        NaN         PR   
3321       72149   Villalba Municipio, Puerto Rico        NaN         PR   
3322       72151    Yabucoa Municipio, Puerto Rico        NaN         PR   
3323       72153      Yauco Municipio, Puerto Rico        NaN         PR   

       per_gop   per_dem  total_2016  dem_2016  gop_2016  rural_urban_code  \
0          NaN       NaN         NaN       NaN       NaN               NaN   
1          NaN       NaN         NaN       NaN       NaN               NaN   
2     0.714368  0.270184     24661.0    5908.0   18110.0               2.0   
3     0.761714  0.224090     94090.0   18409.0   72780.0               3.0   
4     0.534512  0.457882     10390.0    4848.0    5431.0               6.0   
...        ...       ...         ...       ...       ...               ...   
3319       NaN       NaN         NaN       NaN       NaN               1.0   
3320       NaN       NaN         NaN       NaN       NaN               7.0   
3321       NaN       NaN         NaN       NaN       NaN               2.0   
3322       NaN       NaN         NaN       NaN       NaN               1.0   
3323       NaN       NaN         NaN       NaN       NaN               2.0   

      urban_influence_code  percent_no_highschool  percent_highschool  \
0                      NaN              11.998918           26.956844   
1                      NaN              13.819302           30.800268   
2                      2.0              11.483395           33.588459   
3                      2.0               9.193843           27.659616   
4                      6.0              26.786907           35.604542   
...                    ...                    ...                 ...   
3319                   1.0              28.428238           26.225822   
3320                  12.0              28.773281           39.177906   
3321                   2.0              21.993263           38.366028   
3322                   1.0              29.048897           25.715004   
3323                   2.0              26.556698           33.272095   

      percent_college  percent_bachelor  percent_poverty  \
0           28.898697         32.145542             12.3   
1           29.912098         25.468332             15.6   
2           28.356571         26.571573             12.1   
3           31.284081         31.862459             10.1   
4           26.029837         11.578713             27.1   
...               ...               ...              ...   
3319        24.123638         21.222300              NaN   
3320        14.049454         17.999357              NaN   
3321        19.727892         19.912819              NaN   
3322        27.233078         18.003019              NaN   
3323        15.529844         24.641363              NaN   

      median_household_income  unemployment_rate  Employed_2019  \
0                     65712.0           3.669409    157115247.0   
1                     51771.0           3.000000      2174483.0   
2                     58233.0           2.700000        25458.0   
3                     59871.0           2.700000        94675.0   
4                     35972.0           3.800000         8213.0   
...                       ...                ...            ...   
3319                      NaN           9.600000        11791.0   
3320                      NaN           6.900000         2406.0   
3321                      NaN          15.900000         6231.0   
3322                      NaN          13.100000         7552.0   
3323                      NaN          14.600000         8331.0   

      Unemployed_2019  
0           5984808.0  
1             67264.0  
2               714.0  
3              2653.0  
4               324.0  
...               ...  
3319           1246.0  
3320            179.0  
3321           1175.0  
3322           1139.0  
3323           1428.0  

[3324 rows x 20 columns]

Fichier 'checkpoints/save_3.xlsx' sauvegardé avec toutes les données intégrées.
In [261]:
# Vérifier les valeurs manquantes
missing_data = datamap.isnull().sum()
print("Valeurs manquantes par colonne :\n", missing_data)

# Pourcentage de valeurs manquantes
missing_percentage = (missing_data / len(datamap)) * 100
print("\nPourcentage de valeurs manquantes par colonne :\n", missing_percentage)

# Vérifier les doublons
duplicate_count = datamap.duplicated().sum()
print(f"Nombre de doublons dans le DataFrame : {duplicate_count}")

# Statistiques descriptives
stats = datamap.describe()
print("Statistiques descriptives :\n", stats)
Valeurs manquantes par colonne :
 county_code                  0
county_name                  0
state_name                 172
state_code                  41
per_gop                    172
per_dem                    172
total_2016                 212
dem_2016                   212
gop_2016                   212
rural_urban_code           104
urban_influence_code       104
percent_no_highschool       51
percent_highschool          51
percent_college             51
percent_bachelor            51
percent_poverty            131
median_household_income    131
unemployment_rate           52
Employed_2019               52
Unemployed_2019             52
dtype: int64

Pourcentage de valeurs manquantes par colonne :
 county_code                0.000000
county_name                0.000000
state_name                 5.174489
state_code                 1.233454
per_gop                    5.174489
per_dem                    5.174489
total_2016                 6.377858
dem_2016                   6.377858
gop_2016                   6.377858
rural_urban_code           3.128761
urban_influence_code       3.128761
percent_no_highschool      1.534296
percent_highschool         1.534296
percent_college            1.534296
percent_bachelor           1.534296
percent_poverty            3.941035
median_household_income    3.941035
unemployment_rate          1.564380
Employed_2019              1.564380
Unemployed_2019            1.564380
dtype: float64
Nombre de doublons dans le DataFrame : 0
Statistiques descriptives :
            per_gop      per_dem    total_2016      dem_2016       gop_2016  \
count  3152.000000  3152.000000  3.112000e+03  3.112000e+03    3112.000000   
mean      0.647805     0.333851  4.089631e+04  1.956104e+04   19343.762211   
std       0.162014     0.159852  1.082522e+05  6.847899e+04   39125.598644   
min       0.053973     0.030909  6.400000e+01  4.000000e+00      57.000000   
25%       0.554128     0.209978  4.815000e+03  1.164750e+03    3206.000000   
50%       0.681720     0.300235  1.092950e+04  3.140000e+03    7113.000000   
75%       0.773776     0.425830  2.866450e+04  9.535250e+03   17391.750000   
max       0.961818     0.921497  2.314275e+06  1.654626e+06  590465.000000   

       rural_urban_code  urban_influence_code  percent_no_highschool  \
count       3220.000000           3220.000000            3273.000000   
mean           4.937888              5.188820              13.330532   
std            2.724344              3.506848               6.545762   
min            1.000000              1.000000               1.116910   
25%            2.000000              2.000000               8.540109   
50%            6.000000              5.000000              11.884497   
75%            7.000000              8.000000              17.020765   
max            9.000000             12.000000              73.560211   

       percent_highschool  percent_college  percent_bachelor  percent_poverty  \
count         3273.000000      3273.000000       3273.000000      3193.000000   
mean            33.956041        30.587866         22.125561        14.417946   
std              7.212828         5.340745          9.536379         5.769337   
min              7.265136         5.235602          0.000000         2.700000   
25%             29.369493        27.022669         15.511985        10.400000   
50%             34.249691        30.628832         19.776859        13.400000   
75%             38.947308        34.079266         26.348045        17.400000   
max             57.433674        60.563381         77.557411        47.700000   

       median_household_income  unemployment_rate  Employed_2019  \
count              3193.000000        3272.000000   3.272000e+03   
mean              55874.761979           4.139630   1.446578e+05   
std               14493.345229           1.785734   2.808116e+06   
min               24732.000000           0.700000   2.120000e+02   
25%               46309.000000           3.000000   4.778750e+03   
50%               53505.000000           3.700000   1.129300e+04   
75%               62327.000000           4.700000   3.218725e+04   
max              151806.000000          19.300000   1.571152e+08   

       Unemployed_2019  
count     3.272000e+03  
mean      5.541794e+03  
std       1.071326e+05  
min       4.000000e+00  
25%       2.030000e+02  
50%       4.990000e+02  
75%       1.328500e+03  
max       5.984808e+06  

REMARQUE¶

Ici on remarque que les données qui manques pour per_gop, per_dem, total_2016, dem_2016, gop_2016, rural_urban_code, urban_influence_code les zone regionales.

par exemple Alasca (02000) a ses county 02*** Alabama (01000) a ses county 01***

In [262]:
# Étape initiale : Identifier les états
datamap['is_state'] = datamap['county_code'].apply(lambda x: 1 if x.endswith('000') else 0)
datamap_complete = datamap.copy()

# Supprimer la ligne correspondant à county_code = "00000" (United States)
datamap_complete = datamap_complete[datamap_complete['county_code'] != '00000']
print(f"Ligne avec county_code '00000' (United States) supprimée. Nombre de lignes restantes : {len(datamap_complete)}")

# Mettre à jour state_name pour les états
datamap_complete.loc[datamap_complete['is_state'] == 1, 'state_name'] = datamap_complete['county_name']

# Afficher le dataframe initial
print(datamap_complete)

# Sauvegarder le fichier enrichi
datamap_complete.to_excel('checkpoints/save_4.xlsx', index=False)
print("\nFichier 'checkpoints/save_4.xlsx' sauvegardé avec toutes les données intégrées.")

# Ajouter state_prefix
datamap_complete['state_prefix'] = datamap_complete['county_code'].str[:2]

# Agrégation existante pour per_gop, per_dem, etc.
state_agg = datamap_complete[datamap_complete['is_state'] == 0].groupby('state_prefix').agg({
    'per_gop': 'mean', 
    'per_dem': 'mean', 
    'gop_2016': 'sum', 
    'dem_2016': 'sum', 
    'total_2016': 'sum'
}).reset_index()

# Mettre à jour les lignes des états avec les agrégations
for index, row in state_agg.iterrows():
    state_prefix = row['state_prefix']
    state_county_code = state_prefix + '000'
    datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'per_gop'] = row['per_gop']
    datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'per_dem'] = row['per_dem']
    datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'total_2016'] = row['total_2016']
    datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'gop_2016'] = row['gop_2016']
    datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'dem_2016'] = row['dem_2016']

# Nouvelle étape : Calculer la distribution des rural_urban_code par état
ruc_dist = datamap_complete[datamap_complete['is_state'] == 0].groupby('state_prefix')['rural_urban_code'].value_counts(normalize=True).unstack(fill_value=0)
ruc_dist.columns = [f'ruc_{int(col)}' for col in ruc_dist.columns]  # Renommer colonnes : ruc_1, ruc_2, etc.

# Si urban_influence_code est présent (préparation pour le futur)
if 'urban_influence_code' in datamap_complete.columns:
    uic_dist = datamap_complete[datamap_complete['is_state'] == 0].groupby('state_prefix')['urban_influence_code'].value_counts(normalize=True).unstack(fill_value=0)
    uic_dist.columns = [f'uic_{int(col)}' for col in uic_dist.columns]  # Renommer : uic_1, uic_2, etc.
else:
    print("Note : 'urban_influence_code' n'est pas dans le dataset. Seuls les RUCC seront calculés.")

# Fusionner avec les lignes des états
df_states = datamap_complete[datamap_complete['is_state'] == 1].set_index('state_prefix')
df_states = df_states.join(ruc_dist, how='left')
if 'urban_influence_code' in datamap_complete.columns:
    df_states = df_states.join(uic_dist, how='left')

# Réintégrer dans le dataframe complet
datamap_complete = pd.concat([datamap_complete[datamap_complete['is_state'] == 0], df_states.reset_index()], ignore_index=True)

# Remplir les NaN avec 0 pour les nouvelles colonnes
for col in ruc_dist.columns:
    datamap_complete[col] = datamap_complete[col].fillna(0)
if 'urban_influence_code' in datamap_complete.columns:
    for col in uic_dist.columns:
        datamap_complete[col] = datamap_complete[col].fillna(0)

# Afficher les colonnes pertinentes, y compris les nouvelles distributions RUCC
print(datamap_complete[['county_code', 'county_name', 'state_name', 'state_code', 'per_gop', 'per_dem', 'total_2016', 'gop_2016', 'dem_2016', 'is_state'] + [col for col in datamap_complete.columns if col.startswith('ruc_')]])

# Sauvegarder le fichier enrichi
datamap_complete.to_excel('checkpoints/save_5.xlsx', index=False)
print("\nFichier 'checkpoints/save_5.xlsx' sauvegardé avec toutes les données intégrées, y compris les distributions RUCC.")
Ligne avec county_code '00000' (United States) supprimée. Nombre de lignes restantes : 3323
     county_code                       county_name state_name state_code  \
1          01000                           Alabama    Alabama         AL   
2          01001                    Autauga County    Alabama         AL   
3          01003                    Baldwin County    Alabama         AL   
4          01005                    Barbour County    Alabama         AL   
5          01007                       Bibb County    Alabama         AL   
...          ...                               ...        ...        ...   
3319       72145  Vega Baja Municipio, Puerto Rico        NaN         PR   
3320       72147    Vieques Municipio, Puerto Rico        NaN         PR   
3321       72149   Villalba Municipio, Puerto Rico        NaN         PR   
3322       72151    Yabucoa Municipio, Puerto Rico        NaN         PR   
3323       72153      Yauco Municipio, Puerto Rico        NaN         PR   

       per_gop   per_dem  total_2016  dem_2016  gop_2016  rural_urban_code  \
1          NaN       NaN         NaN       NaN       NaN               NaN   
2     0.714368  0.270184     24661.0    5908.0   18110.0               2.0   
3     0.761714  0.224090     94090.0   18409.0   72780.0               3.0   
4     0.534512  0.457882     10390.0    4848.0    5431.0               6.0   
5     0.784263  0.206983      8748.0    1874.0    6733.0               1.0   
...        ...       ...         ...       ...       ...               ...   
3319       NaN       NaN         NaN       NaN       NaN               1.0   
3320       NaN       NaN         NaN       NaN       NaN               7.0   
3321       NaN       NaN         NaN       NaN       NaN               2.0   
3322       NaN       NaN         NaN       NaN       NaN               1.0   
3323       NaN       NaN         NaN       NaN       NaN               2.0   

      ...  percent_no_highschool  percent_highschool  percent_college  \
1     ...              13.819302           30.800268        29.912098   
2     ...              11.483395           33.588459        28.356571   
3     ...               9.193843           27.659616        31.284081   
4     ...              26.786907           35.604542        26.029837   
5     ...              20.942602           44.878773        23.800098   
...   ...                    ...                 ...              ...   
3319  ...              28.428238           26.225822        24.123638   
3320  ...              28.773281           39.177906        14.049454   
3321  ...              21.993263           38.366028        19.727892   
3322  ...              29.048897           25.715004        27.233078   
3323  ...              26.556698           33.272095        15.529844   

      percent_bachelor  percent_poverty  median_household_income  \
1            25.468332             15.6                  51771.0   
2            26.571573             12.1                  58233.0   
3            31.862459             10.1                  59871.0   
4            11.578713             27.1                  35972.0   
5            10.378526             20.3                  47918.0   
...                ...              ...                      ...   
3319         21.222300              NaN                      NaN   
3320         17.999357              NaN                      NaN   
3321         19.912819              NaN                      NaN   
3322         18.003019              NaN                      NaN   
3323         24.641363              NaN                      NaN   

      unemployment_rate  Employed_2019  Unemployed_2019  is_state  
1                   3.0      2174483.0          67264.0         1  
2                   2.7        25458.0            714.0         0  
3                   2.7        94675.0           2653.0         0  
4                   3.8         8213.0            324.0         0  
5                   3.1         8419.0            266.0         0  
...                 ...            ...              ...       ...  
3319                9.6        11791.0           1246.0         0  
3320                6.9         2406.0            179.0         0  
3321               15.9         6231.0           1175.0         0  
3322               13.1         7552.0           1139.0         0  
3323               14.6         8331.0           1428.0         0  

[3323 rows x 21 columns]

Fichier 'checkpoints/save_4.xlsx' sauvegardé avec toutes les données intégrées.
     county_code     county_name     state_name state_code   per_gop  \
0          01001  Autauga County        Alabama         AL  0.714368   
1          01003  Baldwin County        Alabama         AL  0.761714   
2          01005  Barbour County        Alabama         AL  0.534512   
3          01007     Bibb County        Alabama         AL  0.784263   
4          01009   Blount County        Alabama         AL  0.895716   
...          ...             ...            ...        ...       ...   
3318       53000      Washington     Washington         WA  0.520402   
3319       54000   West Virginia  West Virginia         WV  0.741402   
3320       55000       Wisconsin      Wisconsin         WI  0.564259   
3321       56000         Wyoming        Wyoming         WY  0.750912   
3322       72000     Puerto Rico    Puerto Rico         PR       NaN   

       per_dem  total_2016   gop_2016   dem_2016  is_state     ruc_1  \
0     0.270184     24661.0    18110.0     5908.0         0  0.000000   
1     0.224090     94090.0    72780.0    18409.0         0  0.000000   
2     0.457882     10390.0     5431.0     4848.0         0  0.000000   
3     0.206983      8748.0     6733.0     1874.0         0  0.000000   
4     0.095694     25384.0    22808.0     2150.0         0  0.000000   
...        ...         ...        ...        ...       ...       ...   
3318  0.448434   2765627.0  1043648.0  1523720.0         1  0.128205   
3319  0.243346    708226.0   486198.0   187457.0         1  0.018182   
3320  0.419635   2944620.0  1409467.0  1382210.0         1  0.097222   
3321  0.217684    248742.0   174248.0    55949.0         1  0.000000   
3322       NaN         0.0        0.0        0.0         1  0.512821   

         ruc_2     ruc_3     ruc_4     ruc_5     ruc_6     ruc_7     ruc_8  \
0     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
1     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
2     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
3     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
4     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
...        ...       ...       ...       ...       ...       ...       ...   
3318  0.179487  0.230769  0.153846  0.051282  0.102564  0.025641  0.076923   
3319  0.090909  0.272727  0.036364  0.018182  0.236364  0.127273  0.127273   
3320  0.111111  0.152778  0.097222  0.000000  0.277778  0.083333  0.111111   
3321  0.000000  0.086957  0.043478  0.086957  0.043478  0.565217  0.000000   
3322  0.205128  0.166667  0.038462  0.000000  0.051282  0.012821  0.000000   

         ruc_9  
0     0.000000  
1     0.000000  
2     0.000000  
3     0.000000  
4     0.000000  
...        ...  
3318  0.051282  
3319  0.072727  
3320  0.069444  
3321  0.173913  
3322  0.012821  

[3323 rows x 19 columns]

Fichier 'checkpoints/save_5.xlsx' sauvegardé avec toutes les données intégrées, y compris les distributions RUCC.
In [263]:
# Filtrer uniquement les États (is_state == 1)
datamap_states = datamap_complete[datamap_complete['is_state'] == 1].copy()

# Vérification
print(f"Nombre total d'observations : {len(datamap)}")
print(f"Nombre d'États : {len(datamap_states)}")
print(datamap_states.head())

# Sauvegarder le fichier enrichi
datamap_states.to_excel('checkpoints/states_1.xlsx', index=False)
print("\nFichier 'checkpoints/states_1.xlsx' sauvegardé avec toutes les données intégrées.")
Nombre total d'observations : 3324
Nombre d'États : 52
     county_code county_name  state_name state_code   per_gop   per_dem  \
3271       01000     Alabama     Alabama         AL  0.647359  0.342648   
3272       02000      Alaska      Alaska         AK  0.497797  0.420912   
3273       04000     Arizona     Arizona         AZ  0.548723  0.435861   
3274       05000    Arkansas    Arkansas         AR  0.688531  0.282032   
3275       06000  California  California         CA  0.439389  0.537068   

      total_2016   dem_2016   gop_2016  rural_urban_code  ...     uic_3  \
3271   2078165.0   718084.0  1306925.0               NaN  ...  0.044776   
3272         0.0        0.0        0.0               NaN  ...  0.000000   
3273   2062810.0   936250.0  1021154.0               NaN  ...  0.066667   
3274   1108615.0   378729.0   677904.0               NaN  ...  0.040000   
3275   9631972.0  5931283.0  3184721.0               NaN  ...  0.017241   

         uic_4     uic_5     uic_6     uic_7     uic_8     uic_9    uic_10  \
3271  0.059701  0.104478  0.208955  0.029851  0.000000  0.000000  0.044776   
3272  0.000000  0.000000  0.000000  0.034483  0.068966  0.000000  0.068966   
3273  0.066667  0.133333  0.066667  0.000000  0.066667  0.066667  0.000000   
3274  0.013333  0.053333  0.213333  0.040000  0.133333  0.146667  0.066667   
3275  0.051724  0.068966  0.086207  0.034483  0.051724  0.000000  0.000000   

        uic_11    uic_12  
3271  0.074627  0.000000  
3272  0.344828  0.379310  
3273  0.000000  0.000000  
3274  0.026667  0.000000  
3275  0.034483  0.017241  

[5 rows x 43 columns]

Fichier 'checkpoints/states_1.xlsx' sauvegardé avec toutes les données intégrées.
In [264]:
to_drop_state_cols = ['rural_urban_code', 'urban_influence_code','is_state','state_prefix','county_name']
cleaned_state_df = datamap_states.drop(to_drop_state_cols, axis=1)
cleaned_state_df.rename(columns={'county_code': 'id'}, inplace=True)
cleaned_state_df.head()
Out[264]:
id state_name state_code per_gop per_dem total_2016 dem_2016 gop_2016 percent_no_highschool percent_highschool ... uic_3 uic_4 uic_5 uic_6 uic_7 uic_8 uic_9 uic_10 uic_11 uic_12
3271 01000 Alabama AL 0.647359 0.342648 2078165.0 718084.0 1306925.0 13.819302 30.800268 ... 0.044776 0.059701 0.104478 0.208955 0.029851 0.000000 0.000000 0.044776 0.074627 0.000000
3272 02000 Alaska AK 0.497797 0.420912 0.0 0.0 0.0 7.152934 28.003729 ... 0.000000 0.000000 0.000000 0.000000 0.034483 0.068966 0.000000 0.068966 0.344828 0.379310
3273 04000 Arizona AZ 0.548723 0.435861 2062810.0 936250.0 1021154.0 12.860705 23.858877 ... 0.066667 0.066667 0.133333 0.066667 0.000000 0.066667 0.066667 0.000000 0.000000 0.000000
3274 05000 Arkansas AR 0.688531 0.282032 1108615.0 378729.0 677904.0 13.430243 34.034885 ... 0.040000 0.013333 0.053333 0.213333 0.040000 0.133333 0.146667 0.066667 0.026667 0.000000
3275 06000 California CA 0.439389 0.537068 9631972.0 5931283.0 3184721.0 16.692171 20.487896 ... 0.017241 0.051724 0.068966 0.086207 0.034483 0.051724 0.000000 0.000000 0.034483 0.017241

5 rows × 38 columns

In [265]:
# Retirons l'état à l'id 72000 qui correspond au 'Puerto Rico' vu qu'il ne dispose pas de details permettant de determiner la variable cible

cleaned_state_df = cleaned_state_df[cleaned_state_df['id'] != '72000']
cleaned_state_df
Out[265]:
id state_name state_code per_gop per_dem total_2016 dem_2016 gop_2016 percent_no_highschool percent_highschool ... uic_3 uic_4 uic_5 uic_6 uic_7 uic_8 uic_9 uic_10 uic_11 uic_12
3271 01000 Alabama AL 0.647359 0.342648 2078165.0 718084.0 1306925.0 13.819302 30.800268 ... 0.044776 0.059701 0.104478 0.208955 0.029851 0.000000 0.000000 0.044776 0.074627 0.000000
3272 02000 Alaska AK 0.497797 0.420912 0.0 0.0 0.0 7.152934 28.003729 ... 0.000000 0.000000 0.000000 0.000000 0.034483 0.068966 0.000000 0.068966 0.344828 0.379310
3273 04000 Arizona AZ 0.548723 0.435861 2062810.0 936250.0 1021154.0 12.860705 23.858877 ... 0.066667 0.066667 0.133333 0.066667 0.000000 0.066667 0.066667 0.000000 0.000000 0.000000
3274 05000 Arkansas AR 0.688531 0.282032 1108615.0 378729.0 677904.0 13.430243 34.034885 ... 0.040000 0.013333 0.053333 0.213333 0.040000 0.133333 0.146667 0.066667 0.026667 0.000000
3275 06000 California CA 0.439389 0.537068 9631972.0 5931283.0 3184721.0 16.692171 20.487896 ... 0.017241 0.051724 0.068966 0.086207 0.034483 0.051724 0.000000 0.000000 0.034483 0.017241
3276 08000 Colorado CO 0.559502 0.417248 2564185.0 1212209.0 1137455.0 8.253678 21.368059 ... 0.015625 0.015625 0.046875 0.062500 0.046875 0.109375 0.000000 0.140625 0.125000 0.171875
3277 09000 Connecticut CT 0.424576 0.557866 1623542.0 884432.0 668266.0 9.369879 26.854712 ... 0.125000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3278 10000 Delaware DE 0.443037 0.542737 441535.0 235581.0 185103.0 9.982669 31.292805 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3279 11000 District of Columbia DC 0.053973 0.921497 280272.0 260223.0 11553.0 9.076816 16.835115 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3280 12000 Florida FL 0.633620 0.357409 9386750.0 4485745.0 4605515.0 11.810859 28.573500 ... 0.089552 0.044776 0.014925 0.149254 0.029851 0.000000 0.000000 0.000000 0.000000 0.014925
3281 13000 Georgia GA 0.639809 0.350515 4029564.0 1837300.0 2068623.0 12.855142 27.714716 ... 0.037736 0.044025 0.075472 0.157233 0.050314 0.062893 0.056604 0.012579 0.012579 0.025157
3282 15000 Hawaii HI 0.330023 0.648358 428825.0 266827.0 128815.0 8.028257 27.356155 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.400000 0.000000 0.000000 0.000000 0.000000
3283 16000 Idaho ID 0.730509 0.240739 688235.0 189677.0 407199.0 9.226700 27.356852 ... 0.000000 0.000000 0.159091 0.136364 0.113636 0.181818 0.045455 0.000000 0.045455 0.045455
3284 17000 Illinois IL 0.652193 0.327196 5374280.0 2977498.0 2118179.0 10.787586 25.954943 ... 0.049020 0.058824 0.078431 0.147059 0.029412 0.107843 0.088235 0.009804 0.019608 0.019608
3285 18000 Indiana IN 0.688717 0.291546 2722029.0 1031953.0 1556220.0 11.181375 33.406757 ... 0.076087 0.108696 0.141304 0.097826 0.021739 0.054348 0.021739 0.000000 0.000000 0.000000
3286 19000 Iowa IA 0.638318 0.344197 1542880.0 650790.0 798923.0 7.908792 30.982542 ... 0.000000 0.000000 0.050505 0.262626 0.070707 0.121212 0.141414 0.050505 0.050505 0.040404
3287 20000 Kansas KS 0.752552 0.227409 1147143.0 414788.0 656009.0 9.048409 25.906137 ... 0.019048 0.019048 0.047619 0.085714 0.028571 0.104762 0.095238 0.142857 0.066667 0.209524
3288 21000 Kentucky KY 0.740423 0.245121 1922218.0 628834.0 1202942.0 13.738696 32.893917 ... 0.033333 0.041667 0.033333 0.091667 0.066667 0.150000 0.083333 0.091667 0.041667 0.075000
3289 22000 Louisiana LA 0.646492 0.339012 2027731.0 779535.0 1178004.0 14.773869 33.962753 ... 0.015625 0.015625 0.093750 0.171875 0.031250 0.031250 0.015625 0.046875 0.031250 0.000000
3290 23000 Maine ME 0.486435 0.485294 741550.0 354873.0 334838.0 7.389770 31.473530 ... 0.000000 0.000000 0.062500 0.375000 0.062500 0.000000 0.000000 0.000000 0.187500 0.000000
3291 24000 Maryland MD 0.477075 0.498362 2474543.0 1497951.0 873646.0 9.796140 24.611042 ... 0.041667 0.083333 0.041667 0.000000 0.041667 0.000000 0.000000 0.000000 0.000000 0.000000
3292 25000 Massachusetts MA 0.308799 0.668589 3231531.0 1964768.0 1083069.0 9.242436 24.019262 ... 0.000000 0.000000 0.071429 0.000000 0.000000 0.071429 0.000000 0.000000 0.071429 0.000000
3293 26000 Michigan MI 0.596681 0.387700 4789450.0 2267373.0 2279210.0 9.190457 28.873878 ... 0.012048 0.024096 0.108434 0.036145 0.024096 0.180723 0.048193 0.120482 0.108434 0.024096
3294 27000 Minnesota MN 0.602136 0.376106 2916404.0 1366676.0 1322891.0 6.859513 24.647589 ... 0.045977 0.080460 0.080460 0.126437 0.057471 0.068966 0.091954 0.034483 0.034483 0.068966
3295 28000 Mississippi MS 0.562834 0.423827 1162987.0 462001.0 678457.0 15.493731 30.438028 ... 0.024390 0.048780 0.073171 0.109756 0.109756 0.219512 0.146341 0.060976 0.000000 0.000000
3296 29000 Missouri MO 0.752084 0.232851 2775098.0 1054889.0 1585753.0 10.078580 30.617037 ... 0.034783 0.095652 0.060870 0.113043 0.052174 0.095652 0.060870 0.139130 0.034783 0.017391
3297 30000 Montana MT 0.689422 0.287197 483574.0 174521.0 274120.0 6.449651 28.832081 ... 0.000000 0.000000 0.000000 0.071429 0.160714 0.089286 0.089286 0.053571 0.160714 0.285714
3298 31000 Nebraska NE 0.780742 0.198169 805638.0 273858.0 485819.0 8.595745 26.106092 ... 0.000000 0.000000 0.043011 0.053763 0.096774 0.139785 0.064516 0.172043 0.086022 0.204301
3299 32000 Nevada NV 0.696978 0.277140 1122990.0 537753.0 511319.0 13.309174 28.085070 ... 0.117647 0.058824 0.117647 0.000000 0.000000 0.176471 0.058824 0.117647 0.117647 0.000000
3300 33000 New Hampshire NH 0.458564 0.523999 730628.0 348126.0 345598.0 6.894038 27.419645 ... 0.200000 0.100000 0.100000 0.000000 0.000000 0.300000 0.000000 0.000000 0.000000 0.000000
3301 34000 New Jersey NJ 0.437915 0.544962 3674893.0 2021756.0 1535513.0 10.183736 27.185795 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3302 35000 New Mexico NM 0.532221 0.447902 783127.0 380724.0 315875.0 14.411727 26.460430 ... 0.000000 0.000000 0.181818 0.060606 0.000000 0.242424 0.090909 0.060606 0.090909 0.060606
3303 36000 New York NY 0.508439 0.471869 7046175.0 4143874.0 2640570.0 13.179301 25.977776 ... 0.096774 0.032258 0.080645 0.080645 0.032258 0.048387 0.016129 0.000000 0.000000 0.000000
3304 37000 North Carolina NC 0.584579 0.403189 4629471.0 2162074.0 2339603.0 12.219548 25.652466 ... 0.110000 0.050000 0.130000 0.050000 0.060000 0.040000 0.020000 0.040000 0.010000 0.030000
3305 38000 North Dakota ND 0.725569 0.247834 336968.0 93526.0 216133.0 7.351314 26.429096 ... 0.000000 0.000000 0.018868 0.037736 0.150943 0.113208 0.018868 0.245283 0.037736 0.264151
3306 39000 Ohio OH 0.674596 0.310536 5325395.0 2317001.0 2771984.0 9.621357 33.037495 ... 0.193182 0.056818 0.159091 0.079545 0.011364 0.022727 0.034091 0.011364 0.000000 0.000000
3307 40000 Oklahoma OK 0.778398 0.202755 1451056.0 419788.0 947934.0 11.976947 31.330032 ... 0.038961 0.077922 0.064935 0.129870 0.012987 0.129870 0.155844 0.129870 0.012987 0.012987
3308 41000 Oregon OR 0.566156 0.404151 1808575.0 934631.0 742506.0 9.287846 22.735300 ... 0.083333 0.027778 0.138889 0.027778 0.000000 0.138889 0.000000 0.083333 0.055556 0.083333
3309 42000 Pennsylvania PA 0.635927 0.350978 5970107.0 2844705.0 2912941.0 9.480545 34.693886 ... 0.059701 0.044776 0.149254 0.029851 0.059701 0.029851 0.059701 0.014925 0.000000 0.000000
3310 44000 Rhode Island RI 0.380590 0.598526 450121.0 249902.0 179421.0 11.186517 28.267363 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3311 45000 South Carolina SC 0.535671 0.452410 2084444.0 849469.0 1143611.0 12.488965 29.103146 ... 0.000000 0.021739 0.173913 0.217391 0.000000 0.000000 0.021739 0.000000 0.000000 0.000000
3312 46000 South Dakota SD 0.673640 0.305517 370047.0 117442.0 227701.0 8.253245 30.237791 ... 0.000000 0.000000 0.045455 0.060606 0.106061 0.151515 0.015152 0.303030 0.030303 0.166667
3313 47000 Tennessee TN 0.747807 0.237236 2484691.0 867110.0 1517402.0 12.537143 32.088009 ... 0.073684 0.094737 0.073684 0.115789 0.052632 0.063158 0.031579 0.042105 0.000000 0.010526
3314 48000 Texas TX 0.743895 0.245202 8903237.0 3867816.0 4681590.0 16.313875 24.957039 ... 0.047244 0.078740 0.074803 0.133858 0.051181 0.059055 0.070866 0.047244 0.051181 0.062992
3315 49000 Utah UT 0.728772 0.241405 852461.0 237241.0 397004.0 7.719078 22.836246 ... 0.068966 0.000000 0.034483 0.068966 0.034483 0.068966 0.103448 0.068966 0.103448 0.103448
3316 50000 Vermont VT 0.351706 0.615269 291413.0 178179.0 95053.0 7.327794 28.795351 ... 0.000000 0.000000 0.214286 0.071429 0.071429 0.214286 0.142857 0.000000 0.071429 0.000000
3317 51000 Virginia VA 0.551998 0.431626 3844787.0 1916845.0 1731156.0 10.305691 23.953545 ... 0.000000 0.150376 0.030075 0.060150 0.045113 0.030075 0.030075 0.007519 0.007519 0.037594
3318 53000 Washington WA 0.520402 0.448434 2765627.0 1523720.0 1043648.0 8.672709 21.999466 ... 0.102564 0.051282 0.076923 0.051282 0.051282 0.051282 0.025641 0.000000 0.000000 0.051282
3319 54000 West Virginia WV 0.741402 0.243346 708226.0 187457.0 486198.0 13.097753 40.320034 ... 0.000000 0.000000 0.090909 0.127273 0.181818 0.054545 0.072727 0.054545 0.000000 0.036364
3320 55000 Wisconsin WI 0.564259 0.419635 2944620.0 1382210.0 1409467.0 7.791683 30.638811 ... 0.055556 0.041667 0.125000 0.236111 0.027778 0.013889 0.041667 0.000000 0.027778 0.069444
3321 56000 Wyoming WY 0.750912 0.217684 248742.0 55949.0 174248.0 6.834035 29.073072 ... 0.000000 0.000000 0.043478 0.043478 0.000000 0.260870 0.260870 0.086957 0.130435 0.086957

51 rows × 38 columns

In [266]:
# Sauvegarder le fichier enrichi
cleaned_state_df.to_excel('checkpoints/states_df.xlsx', index=False)
print("\nFichier 'checkpoints/states_df.xlsx' sauvegardé avec toutes les données intégrées.")
Fichier 'checkpoints/states_df.xlsx' sauvegardé avec toutes les données intégrées.

Parfait notre dataframe est clean et prêt à être utilisé pour le modèle.


4. ANALYSE EXPLORATOIRE¶

4.1. Recuperation , Inspection et Creation de la variable cible (Target)¶

In [267]:
df = pd.read_excel("checkpoints/states_df.xlsx")
In [268]:
df["target"] = (df["per_gop"] > df["per_dem"]).astype(int)
df = df.drop(columns=["is_state"], errors="ignore")
In [269]:
df.head()
Out[269]:
id state_name state_code per_gop per_dem total_2016 dem_2016 gop_2016 percent_no_highschool percent_highschool ... uic_4 uic_5 uic_6 uic_7 uic_8 uic_9 uic_10 uic_11 uic_12 target
0 1000 Alabama AL 0.647359 0.342648 2078165 718084 1306925 13.819302 30.800268 ... 0.059701 0.104478 0.208955 0.029851 0.000000 0.000000 0.044776 0.074627 0.000000 1
1 2000 Alaska AK 0.497797 0.420912 0 0 0 7.152934 28.003729 ... 0.000000 0.000000 0.000000 0.034483 0.068966 0.000000 0.068966 0.344828 0.379310 1
2 4000 Arizona AZ 0.548723 0.435861 2062810 936250 1021154 12.860705 23.858877 ... 0.066667 0.133333 0.066667 0.000000 0.066667 0.066667 0.000000 0.000000 0.000000 1
3 5000 Arkansas AR 0.688531 0.282032 1108615 378729 677904 13.430243 34.034885 ... 0.013333 0.053333 0.213333 0.040000 0.133333 0.146667 0.066667 0.026667 0.000000 1
4 6000 California CA 0.439389 0.537068 9631972 5931283 3184721 16.692171 20.487896 ... 0.051724 0.068966 0.086207 0.034483 0.051724 0.000000 0.000000 0.034483 0.017241 0

5 rows × 39 columns

In [270]:
df.shape
Out[270]:
(51, 39)
In [271]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       51 non-null     int64  
 1   state_name               51 non-null     object 
 2   state_code               51 non-null     object 
 3   per_gop                  51 non-null     float64
 4   per_dem                  51 non-null     float64
 5   total_2016               51 non-null     int64  
 6   dem_2016                 51 non-null     int64  
 7   gop_2016                 51 non-null     int64  
 8   percent_no_highschool    51 non-null     float64
 9   percent_highschool       51 non-null     float64
 10  percent_college          51 non-null     float64
 11  percent_bachelor         51 non-null     float64
 12  percent_poverty          51 non-null     float64
 13  median_household_income  51 non-null     int64  
 14  unemployment_rate        51 non-null     float64
 15  Employed_2019            51 non-null     int64  
 16  Unemployed_2019          51 non-null     int64  
 17  ruc_1                    51 non-null     float64
 18  ruc_2                    51 non-null     float64
 19  ruc_3                    51 non-null     float64
 20  ruc_4                    51 non-null     float64
 21  ruc_5                    51 non-null     float64
 22  ruc_6                    51 non-null     float64
 23  ruc_7                    51 non-null     float64
 24  ruc_8                    51 non-null     float64
 25  ruc_9                    51 non-null     float64
 26  uic_1                    51 non-null     float64
 27  uic_2                    51 non-null     float64
 28  uic_3                    51 non-null     float64
 29  uic_4                    51 non-null     float64
 30  uic_5                    51 non-null     float64
 31  uic_6                    51 non-null     float64
 32  uic_7                    51 non-null     float64
 33  uic_8                    51 non-null     float64
 34  uic_9                    51 non-null     float64
 35  uic_10                   51 non-null     float64
 36  uic_11                   51 non-null     float64
 37  uic_12                   51 non-null     float64
 38  target                   51 non-null     int64  
dtypes: float64(29), int64(8), object(2)
memory usage: 15.7+ KB
In [272]:
df.describe()
Out[272]:
id per_gop per_dem total_2016 dem_2016 gop_2016 percent_no_highschool percent_highschool percent_college percent_bachelor ... uic_4 uic_5 uic_6 uic_7 uic_8 uic_9 uic_10 uic_11 uic_12 target
count 51.000000 51.000000 51.000000 5.100000e+01 5.100000e+01 5.100000e+01 51.000000 51.000000 51.000000 51.000000 ... 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000
mean 28960.784314 0.586317 0.392727 2.495477e+06 1.193607e+06 1.180349e+06 10.461532 28.010589 29.757235 31.770644 ... 0.034489 0.073865 0.088338 0.042785 0.092530 0.049669 0.050380 0.045169 0.052457 0.784314
std 15832.827649 0.146841 0.145555 2.386844e+06 1.278957e+06 1.076152e+06 2.706207 4.173287 4.023961 6.427930 ... 0.037211 0.054218 0.079990 0.043547 0.087541 0.055531 0.066516 0.063557 0.084690 0.415390
min 1000.000000 0.053973 0.198169 0.000000e+00 0.000000e+00 0.000000e+00 6.449651 16.835115 15.547361 20.614605 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 16500.000000 0.503118 0.279586 7.360890e+05 2.703425e+05 3.713010e+05 8.253462 25.779302 27.592871 27.843889 ... 0.000000 0.038075 0.032998 0.000000 0.029963 0.000000 0.000000 0.000000 0.000000 1.000000
50% 29000.000000 0.602136 0.376106 1.922218e+06 7.795350e+05 9.479340e+05 9.796140 28.003729 29.736067 31.255686 ... 0.024096 0.071429 0.071429 0.034483 0.068966 0.031579 0.034483 0.027778 0.017241 1.000000
75% 41500.000000 0.693200 0.462139 3.088076e+06 1.680510e+06 1.545866e+06 12.696142 30.719540 32.605509 34.425982 ... 0.057821 0.106456 0.128571 0.058586 0.136111 0.078030 0.068966 0.069048 0.065979 1.000000
max 56000.000000 0.780742 0.921497 9.631972e+06 5.931283e+06 4.681590e+06 16.692171 40.320034 36.730377 58.540707 ... 0.150376 0.214286 0.375000 0.181818 0.400000 0.260870 0.303030 0.344828 0.379310 1.000000

8 rows × 37 columns

4.2. Distribution des votes pour les différents partis (Democrates et Republicains)¶

In [273]:
plt.figure(figsize=(12, 6))
sns.histplot(df['per_gop'], bins=30, color='blue', label='Votes Républicains', kde=True)
sns.histplot(df['per_dem'], bins=30, color='red', label='Votes Democrates', kde=True)
plt.title('Distribution des votes pour Démocrates et Republicains')
plt.xlabel('Nombre de Votes')
plt.ylabel('Fréquence')
plt.legend()
plt.show()
No description has been provided for this image

Sur ce graphique, on observe la distribution des pourcentages de vote par état pour les Républicains (bleu) et les Démocrates (rouge).

  • Les états à tendance démocrate affichent généralement des pourcentages entre 20% et 50%, avec un pic autour de 30%, tandis que les états républicains montrent des pourcentages plus élevés, entre 50% et 80%, avec une concentration maximale vers 70-75%.

On remarque donc qu'il y a peu d'états avec un équilibre proche de 50%, ce qui suggère une division politique assez nette entre les états américains.

Via un boxplot, essayons de voir s’il existe une différence significative dans les niveaux de chômage en fonction du parti qui a remporté le comté.

In [274]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='target', y='unemployment_rate', data=df)
plt.title('Boxplot de taux de chômage par parti politique')
plt.xlabel('Target Variable (0 = Démocrates, 1 = Républicains)')
plt.ylabel('Taux de chômage')
plt.show()
No description has been provided for this image

Ce boxplot compare les taux de chômage entre les états à majorité démocrate (0) et républicaine (1).

  • Les états républicains présentent une médiane légèrement inférieure mais une dispersion plus importante, avec des valeurs extrêmes plus élevées (jusqu'à 6%).
  • Les états démocrates montrent une distribution plus resserrée avec une médiane autour de 3.6%.

Bien que les deux groupes aient des taux de chômage globalement similaires, les républicains présentent à la fois des états avec les taux les plus bas et les plus élevés, suggérant une plus grande variabilité économique entre ces états.


4.3. Analyses Univariées¶

In [275]:
colors = {'dem': '#3333FF', 'gop': '#FF3333'}
In [276]:
# Création d'une figure pour regrouper les analyses univariées
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)

# 1. Variables politiques - Histogrammes
ax1 = fig.add_subplot(gs[0, 0])
sns.histplot(df['per_gop'], bins=30, color=colors['gop'], kde=True, ax=ax1)
ax1.set_title('Distribution des votes Républicains par état', fontsize=14)
ax1.set_xlabel('Pourcentage de votes Républicains (%)')
ax1.set_ylabel('Nombre d\'états')

ax2 = fig.add_subplot(gs[0, 1])
sns.histplot(df['per_dem'], bins=30, color=colors['dem'], kde=True, ax=ax2)
ax2.set_title('Distribution des votes Démocrates par état', fontsize=14)
ax2.set_xlabel('Pourcentage de votes Démocrates (%)')
ax2.set_ylabel('Nombre d\'états')
Out[276]:
Text(0, 0.5, "Nombre d'états")
In [277]:
# 2. Variables éducatives - Histogrammes
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)

ax3 = fig.add_subplot(gs[1, 0])
education_vars = ['percent_no_highschool', 'percent_highschool', 
                 'percent_college', 'percent_bachelor']
education_labels = ['Sans diplôme', 'Diplôme secondaire', 
                   'Études supérieures', 'Licence ou plus']
colors_edu = ['#FF9999', '#99FF99', '#9999FF', '#FFFF99']

for i, (var, label, color) in enumerate(zip(education_vars, education_labels, colors_edu)):
    sns.histplot(df[var], bins=20, kde=True, color=color, alpha=0.7, 
                label=label, ax=ax3)
ax3.set_title('Distribution des niveaux d\'éducation par état', fontsize=14)
ax3.set_xlabel('Pourcentage de la population (%)')
ax3.set_ylabel('Nombre d\'états')
ax3.legend()
Out[277]:
<matplotlib.legend.Legend at 0x7fdac5e76320>
In [278]:
# 3. Variables économiques - Histogrammes
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax4 = fig.add_subplot(gs[1, 1])
sns.histplot(df['unemployment_rate'], bins=20, kde=True, color='#66CCFF', ax=ax4)
ax4.set_title('Distribution du taux de chômage par état', fontsize=14)
ax4.set_xlabel('Taux de chômage (%)')
ax4.set_ylabel('Nombre d\'états')

ax5 = fig.add_subplot(gs[2, 0])
sns.histplot(df['percent_poverty'], bins=20, kde=True, color='#FF6666', ax=ax5)
ax5.set_title('Distribution du taux de pauvreté par état', fontsize=14)
ax5.set_xlabel('Taux de pauvreté (%)')
ax5.set_ylabel('Nombre d\'états')

ax6 = fig.add_subplot(gs[2, 1])
sns.histplot(df['median_household_income'], bins=20, kde=True, color='#66CC66', ax=ax6)
ax6.set_title('Distribution du revenu médian des ménages par état', fontsize=14)
ax6.set_xlabel('Revenu médian ($)')
ax6.set_ylabel('Nombre d\'états')
Out[278]:
Text(0, 0.5, "Nombre d'états")
In [279]:
# 4. Variables géographiques - Distribution des codes RUCC (Rural-Urban Continuum Code)
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax7 = fig.add_subplot(gs[3, 0])
ruc_columns = [col for col in df.columns if col.startswith('ruc_')]
ruc_means = df[ruc_columns].mean().sort_values(ascending=False)
ruc_categories = ['Métropolitain, >1M', 'Métropolitain, 250k-1M', 'Métropolitain, <250k',
                 'Urbain, >20k, adj. métro', 'Urbain, >20k, non-adj. métro',
                 'Urbain, 2.5k-20k, adj. métro', 'Urbain, 2.5k-20k, non-adj. métro',
                 'Rural, adj. métro', 'Rural, non-adj. métro']
colors_ruc = plt.cm.Spectral(np.linspace(0, 1, len(ruc_means)))

ax7.pie(ruc_means, labels=[f"RUC {i+1}" for i in range(len(ruc_means))], 
       autopct='%1.1f%%', startangle=90, colors=colors_ruc)
ax7.set_title('Distribution des codes Rural-Urban Continuum (RUC)', fontsize=14)
Out[279]:
Text(0.5, 1.0, 'Distribution des codes Rural-Urban Continuum (RUC)')
In [280]:
# 5. Variables géographiques - Distribution des codes UIC (Urban Influence Codes)
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax8 = fig.add_subplot(gs[3, 1])
uic_columns = [col for col in df.columns if col.startswith('uic_')]
uic_means = df[uic_columns].mean().sort_values(ascending=False)
colors_uic = plt.cm.tab20(np.linspace(0, 1, len(uic_means)))

ax8.pie(uic_means, labels=[f"UIC {i+1}" for i in range(len(uic_means))], 
       autopct='%1.1f%%', startangle=90, colors=colors_uic)
ax8.set_title('Distribution des codes Urban Influence (UIC)', fontsize=14)
Out[280]:
Text(0.5, 1.0, 'Distribution des codes Urban Influence (UIC)')

4.4. Analyses Bivariées¶

In [281]:
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax9 = fig.add_subplot(gs[4, 0])

# Utilisation correcte de `palette` dans `sns.boxplot`
sns.boxplot(
    x='target', 
    y='unemployment_rate', 
    data=df, 
    hue='target',
    legend=False, 
    ax=ax9
)

ax9.set_title('Taux de chômage par affiliation politique', fontsize=14)
ax9.set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
ax9.set_ylabel('Taux de chômage (%)')

plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [282]:
# 2. Boxplots des variables d'éducation par parti
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
education_fig, education_axes = plt.subplots(2, 2, figsize=(15, 12))
education_axes = education_axes.flatten()

for i, (var, label) in enumerate(zip(education_vars, education_labels)):
    sns.boxplot(x='target', y=var, data=df, hue='target', ax=education_axes[i])
    education_axes[i].set_title(f'{label} par affiliation politique', fontsize=12)
    education_axes[i].set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
    education_axes[i].set_ylabel(f'Pourcentage de {label.lower()} (%)')

education_fig.tight_layout()
fig.add_subplot(gs[4, 1])
plt.close(education_fig)
fig.add_subplot(gs[4, 1]).imshow(np.array(education_fig.canvas.renderer._renderer))
fig.add_subplot(gs[4, 1]).axis('off')
Out[282]:
(np.float64(0.0), np.float64(1.0), np.float64(0.0), np.float64(1.0))
In [283]:
# 3. Boxplots des variables économiques par parti
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax10 = fig.add_subplot(gs[5, 0])
sns.boxplot(x='target', y='percent_poverty', data=df, hue='target',
           palette={0: colors['dem'], 1: colors['gop']}, ax=ax10)
ax10.set_title('Taux de pauvreté par affiliation politique', fontsize=14)
ax10.set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
ax10.set_ylabel('Taux de pauvreté (%)')

ax11 = fig.add_subplot(gs[5, 1])
sns.boxplot(x='target', y='median_household_income',hue='target',palette={0: colors['dem'], 1: colors['gop']}, data=df, ax=ax11)
ax11.set_title('Revenu médian des ménages par affiliation politique', fontsize=14)
ax11.set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
ax11.set_ylabel('Revenu médian ($)')

fig.tight_layout(pad=3.0)
plt.show()
No description has been provided for this image
No description has been provided for this image
In [284]:
df.columns
Out[284]:
Index(['id', 'state_name', 'state_code', 'per_gop', 'per_dem', 'total_2016',
       'dem_2016', 'gop_2016', 'percent_no_highschool', 'percent_highschool',
       'percent_college', 'percent_bachelor', 'percent_poverty',
       'median_household_income', 'unemployment_rate', 'Employed_2019',
       'Unemployed_2019', 'ruc_1', 'ruc_2', 'ruc_3', 'ruc_4', 'ruc_5', 'ruc_6',
       'ruc_7', 'ruc_8', 'ruc_9', 'uic_1', 'uic_2', 'uic_3', 'uic_4', 'uic_5',
       'uic_6', 'uic_7', 'uic_8', 'uic_9', 'uic_10', 'uic_11', 'uic_12',
       'target'],
      dtype='object')
In [285]:
# 4. Matrices de corrélation (avec et sans variables socio-économiques)
socio_eco_vars = ['percent_no_highschool', 'percent_highschool', 
                 'percent_college', 'percent_bachelor', 'percent_poverty', 
                 'median_household_income', 'unemployment_rate']

df_corr = df[socio_eco_vars + ['target']]

# Calcul de la corrélation de Pearson
corr_matrix = df_corr.corr()

# Visualisation de la matrice de corrélation sous forme de heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Matrice de Corrélation entre les Variables(sociaux économiques) et la Cible')
plt.show()
No description has been provided for this image
In [286]:
plt.figure(figsize=(16, 12))
socio_eco_vars = ['percent_no_highschool', 'percent_highschool', 
                 'percent_college', 'percent_bachelor', 'percent_poverty', 
                 'median_household_income', 'unemployment_rate']

corr_matrix = df[socio_eco_vars].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
           square=True, linewidths=.5, annot=True, fmt=".2f")
plt.title('Matrice de corrélation des variables socio-économiques et politiques', fontsize=16)
plt.show()
No description has been provided for this image
In [287]:
# 5. Scatter plots des variables les plus corrélées avec l'affiliation politique
plt.figure(figsize=(18, 10))
potential_predictors = ['percent_no_highschool', 'percent_bachelor', 
                       'percent_poverty', 'median_household_income']

for i, var in enumerate(potential_predictors):
    plt.subplot(2, 2, i+1)
    sns.scatterplot(x=var, y='per_gop', data=df, color=colors['gop'], 
                   alpha=0.7, label='Républicains')
    sns.scatterplot(x=var, y='per_dem', data=df, color=colors['dem'], 
                   alpha=0.7, label='Démocrates')
    sns.regplot(x=var, y='per_gop', data=df, color=colors['gop'], 
               scatter=False, ci=None, line_kws={"linestyle": "--"})
    sns.regplot(x=var, y='per_dem', data=df, color=colors['dem'], 
               scatter=False, ci=None, line_kws={"linestyle": "--"})
    plt.title(f'Relation entre {var} et le pourcentage de votes', fontsize=12)
    plt.xlabel(var)
    plt.ylabel('Pourcentage de votes')
    plt.legend()

plt.tight_layout()
plt.show()
No description has been provided for this image
In [288]:
# 6. Analyse de la composition rurale/urbaine par parti politique (Regroupement des codes RUC (Rural Urban Codes))
# RUC 1-3: Zone métropolitaines
# RUC 4-7: Zone urbaine
# RUC 8-9: Zone rurale
# Créer des colonnes agrégées pour l'urbanité
df['urban_pct'] = df[['ruc_1', 'ruc_2', 'ruc_3']].sum(axis=1)  # Zones métropolitaines
df['semi_urban_pct'] = df[['ruc_4', 'ruc_5', 'ruc_6', 'ruc_7']].sum(axis=1)  # Zones urbaines
df['rural_pct'] = df[['ruc_8', 'ruc_9']].sum(axis=1)  # Zones rurales

plt.figure(figsize=(14, 7))
rural_vars = ['urban_pct', 'semi_urban_pct', 'rural_pct']
rural_labels = ['Métropolitain', 'Urbain', 'Rural']

for i, (var, label) in enumerate(zip(rural_vars, rural_labels)):
    plt.subplot(1, 3, i+1)
    sns.boxplot(x='target', y=var, data=df, hue='target', palette={0: colors['dem'], 1: colors['gop']})
    plt.title(f'Distribution {label} (RUC) par affiliation politique', fontsize=12)
    plt.xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
    plt.ylabel(f'Pourcentage {label} (%)')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [289]:
# 7. Regroupement des codes UIC (Urban Influence Codes)
# UIC 1-2: Grands comtés métropolitains
# UIC 3-7: Comtés métropolitains de petite taille ou sous influence métropolitaine
# UIC 8-12: Comtés non-métropolitains ou ruraux

# Création des variables agrégées pour les UIC
df['large_metro_uic'] = df[['uic_1', 'uic_2']].sum(axis=1)  # Grands comtés métropolitains
df['small_metro_uic'] = df[['uic_3', 'uic_4', 'uic_5', 'uic_6', 'uic_7']].sum(axis=1)  # Petits comtés métropolitains ou sous influence
df['rural_uic'] = df[['uic_8', 'uic_9', 'uic_10', 'uic_11', 'uic_12']].sum(axis=1)  # Comtés ruraux

# Visualisation similaire à celle que vous avez faite pour les RUC
plt.figure(figsize=(14, 7))
uic_vars = ['large_metro_uic', 'small_metro_uic', 'rural_uic']
uic_labels = ['Grands Métropolitains', 'Petits Métropolitains', 'Ruraux']

for i, (var, label) in enumerate(zip(uic_vars, uic_labels)):
    plt.subplot(1, 3, i+1)
    sns.boxplot(x='target', y=var, data=df, hue='target', palette={0: colors['dem'], 1: colors['gop']})
    plt.title(f'{label} (UIC) par affiliation politique', fontsize=12)
    plt.xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
    plt.ylabel(f'Pourcentage {label} (%)')

plt.tight_layout()
plt.show()

# Ajoutez également ces variables à votre analyse de corrélation
# Pour voir leur relation avec la variable cible
all_vars = socio_eco_vars + rural_vars + uic_vars + ['target']
full_corr_matrix = df[all_vars].corr()

# Visualisation de la corrélation des nouvelles variables avec la cible
target_corr = full_corr_matrix['target'].sort_values(ascending=False)
print("Corrélation avec la variable cible:")
print(target_corr)
No description has been provided for this image
Corrélation avec la variable cible:
target                     1.000000
percent_college            0.571915
small_metro_uic            0.565129
semi_urban_pct             0.519181
rural_pct                  0.456753
percent_poverty            0.380079
rural_uic                  0.370999
percent_highschool         0.288327
percent_no_highschool      0.129814
unemployment_rate          0.090066
percent_bachelor          -0.599873
median_household_income   -0.659655
large_metro_uic           -0.663431
urban_pct                 -0.663431
Name: target, dtype: float64
In [290]:
# 8. Stacked bar chart de la composition rurale/urbaine par parti (RUC)
df_dem = df[df['target'] == 0]
df_rep = df[df['target'] == 1]

dem_rural_means = [df_dem['urban_pct'].mean(), df_dem['semi_urban_pct'].mean(), df_dem['rural_pct'].mean()]
rep_rural_means = [df_rep['urban_pct'].mean(), df_rep['semi_urban_pct'].mean(), df_rep['rural_pct'].mean()]

plt.figure(figsize=(10, 6))
width = 0.35
x = np.arange(len(rural_labels))

plt.bar(x - width/2, dem_rural_means, width, color=colors['dem'], alpha=0.7, label='Démocrates')
plt.bar(x + width/2, rep_rural_means, width, color=colors['gop'], alpha=0.7, label='Républicains')

plt.xlabel('Type de zone')
plt.ylabel('Pourcentage moyen (%)')
plt.title('Composition metropolitaine moyenne par affiliation politique (RUC)')
plt.xticks(x, rural_labels)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [291]:
# 9. Stacked bar chart de la composition urbaine par parti (UIC)
df_dem = df[df['target'] == 0]
df_rep = df[df['target'] == 1]

dem_urban_means = [df_dem['large_metro_uic'].mean(), df_dem['small_metro_uic'].mean(), df_dem['rural_uic'].mean()]
rep_urban_means = [df_rep['large_metro_uic'].mean(), df_rep['small_metro_uic'].mean(), df_rep['rural_uic'].mean()]

plt.figure(figsize=(10, 6))
width = 0.35
x = np.arange(len(uic_labels))

plt.bar(x - width/2, dem_urban_means, width, color=colors['dem'], alpha=0.7, label='Démocrates')
plt.bar(x + width/2, rep_urban_means, width, color=colors['gop'], alpha=0.7, label='Républicains')

plt.xlabel('Type de zone')
plt.ylabel('Pourcentage moyen (%)')
plt.title('Composition urbaine/rurale moyenne par affiliation politique (UIC)')
plt.xticks(x, uic_labels)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [292]:
# Ajoutez également ces variables à votre analyse de corrélation
# Pour voir leur relation avec la variable cible
all_vars = socio_eco_vars + rural_vars + uic_vars + ['target']
full_corr_matrix = df[all_vars].corr()

# Visualisation de la corrélation des nouvelles variables avec la cible
target_corr = full_corr_matrix['target'].sort_values(ascending=False)
print("Corrélation avec la variable cible:")
print(target_corr)
Corrélation avec la variable cible:
target                     1.000000
percent_college            0.571915
small_metro_uic            0.565129
semi_urban_pct             0.519181
rural_pct                  0.456753
percent_poverty            0.380079
rural_uic                  0.370999
percent_highschool         0.288327
percent_no_highschool      0.129814
unemployment_rate          0.090066
percent_bachelor          -0.599873
median_household_income   -0.659655
large_metro_uic           -0.663431
urban_pct                 -0.663431
Name: target, dtype: float64
In [293]:
# Visualisation de la matrice de corrélation sous forme de heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(full_corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Matrice de Corrélation entre les Variables et la Cible')
plt.show()
No description has been provided for this image
In [294]:
df['margin'] = (df['per_gop'] - df['per_dem'])*100
print(df)

# ---- Carte 1: Tendances électorales avec une échelle de couleur continue ----
fig1 = px.choropleth(
    df,
    locations='state_code',
    locationmode='USA-states',
    color='margin',  # Utiliser la marge au lieu d'une valeur binaire
    color_continuous_scale='RdBu_r',  # Rouge pour GOP, Bleu pour DEM
    range_color=[-30, 30],  # Limiter l'échelle pour mieux voir les différences
    scope='usa',
    title='Tendances électorales par état (écart GOP-DEM)',
    labels={'margin': 'Écart GOP-DEM (%)'}
)
fig1.update_layout(
    geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
    title_font_size=18,
    margin=dict(t=50, b=0, l=0, r=0)
)

# ---- Carte 2: Pourcentage rural avec une échelle de couleur continue ----
fig2 = px.choropleth(
    df,
    locations='state_code',
    locationmode='USA-states',
    color='rural_pct',
    color_continuous_scale='Greens',
    scope='usa',
    title='Pourcentage de zones rurales par état',
    labels={'rural_pct': '% Rural'}
)
fig2.update_layout(
    geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
    title_font_size=18,
    margin=dict(t=50, b=0, l=0, r=0)
)

# ---- Carte 3: Revenu médian des ménages ----
fig3 = px.choropleth(
    df,
    locations='state_code',
    locationmode='USA-states',
    color='median_household_income',
    color_continuous_scale='Viridis',
    scope='usa',
    title='Revenu médian des ménages par état',
    labels={'median_household_income': 'Revenu médian ($)'}
)
fig3.update_layout(
    geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
    title_font_size=18,
    margin=dict(t=50, b=0, l=0, r=0)
)

# ---- Carte 4: Taux de chômage ----
fig4 = px.choropleth(
    df,
    locations='state_code',
    locationmode='USA-states',
    color='unemployment_rate',
    color_continuous_scale='Reds',
    scope='usa',
    title='Taux de chômage par état',
    labels={'unemployment_rate': 'Taux de chômage (%)'}
)
fig4.update_layout(
    geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
    title_font_size=18,
    margin=dict(t=50, b=0, l=0, r=0)
)

# ---- Carte 5: Niveau d'éducation (pourcentage avec un diplôme universitaire) ----
fig5 = px.choropleth(
    df,
    locations='state_code',
    locationmode='USA-states',
    color='percent_bachelor',
    color_continuous_scale='Blues',
    scope='usa',
    title="Pourcentage de la population avec un diplôme universitaire",
    labels={'percent_bachelor': '% Diplôme universitaire'}
)
fig5.update_layout(
    geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
    title_font_size=18,
    margin=dict(t=50, b=0, l=0, r=0)
)

# ---- Carte 6: Taux de pauvreté ----
fig6 = px.choropleth(
    df,
    locations='state_code',
    locationmode='USA-states',
    color='percent_poverty',
    color_continuous_scale='OrRd',
    scope='usa',
    title="Taux de pauvreté par état",
    labels={'percent_poverty': '% Pauvreté'}
)
fig6.update_layout(
    geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
    title_font_size=18,
    margin=dict(t=50, b=0, l=0, r=0)
)

# ---- Scatter plot: Relation entre ruralité et vote républicain ----
fig7 = px.scatter(
    df, 
    x='rural_pct',
    y='per_gop',
    size='total_2016',
    color='margin',
    color_continuous_scale='RdBu_r',
    hover_name='state_name',
    title="Relation entre ruralité et vote républicain",
    labels={
        'rural_pct': '% Rural',
        'per_gop': '% Vote républicain',
        'total_2016': 'Total votes 2016',
        'margin': 'Marge GOP-DEM'
    },
    size_max=40
)
fig7.update_layout(
    title_font_size=18,
    xaxis_title_font_size=14,
    yaxis_title_font_size=14,
    coloraxis_colorbar_title_font_size=14
)

# ---- Scatter plot: Relation entre éducation et vote républicain ----
fig8 = px.scatter(
    df,
    x='percent_bachelor',
    y='per_gop',
    size='total_2016',
    color='median_household_income',
    color_continuous_scale='Viridis',
    hover_name='state_name',
    title="Relation entre niveau d'éducation et vote républicain",
    labels={
        'percent_bachelor': '% Diplôme universitaire',
        'per_gop': '% Vote républicain',
        'total_2016': 'Total votes 2016',
        'median_household_income': 'Revenu médian ($)'
    },
    size_max=40
)
fig8.update_layout(
    title_font_size=18,
    xaxis_title_font_size=14,
    yaxis_title_font_size=14,
    coloraxis_colorbar_title_font_size=14
)

# ---- Heatmap: Corrélation entre les variables ----
# Sélection des variables numériques pertinentes
cols_to_corr = ['per_gop', 'per_dem', 'margin', 'rural_pct', 'percent_no_highschool', 
                'percent_bachelor', 'percent_poverty', 'median_household_income', 
                'unemployment_rate']

# Calcul de la matrice de corrélation
corr_matrix = df[cols_to_corr].corr()

# Créer une heatmap de corrélation
fig9 = px.imshow(
    corr_matrix,
    text_auto='.2f',
    color_continuous_scale='RdBu_r',
    title="Matrice de corrélation entre les variables",
    labels=dict(x="Variables", y="Variables", color="Corrélation")
)
fig9.update_layout(
    title_font_size=18,
    xaxis_title_font_size=14,
    yaxis_title_font_size=14
)

# ---- Bar plot: Top 10 des états les plus républicains et démocrates ----
# On Créé un dataframe pour les 10 états les plus républicains et les 10 les plus démocrates
top_gop = df.sort_values('per_gop', ascending=False).head(10)
top_dem = df.sort_values('per_dem', ascending=False).head(10)

fig10 = make_subplots(rows=1, cols=2, subplot_titles=("Top 10 des états républicains", "Top 10 des états démocrates"))

fig10.add_trace(
    go.Bar(
        x=top_gop['state_name'],
        y=top_gop['per_gop'],
        marker_color='red',
        name='% Républicain'
    ),
    row=1, col=1
)

fig10.add_trace(
    go.Bar(
        x=top_dem['state_name'],
        y=top_dem['per_dem'],
        marker_color='blue',
        name='% Démocrate'
    ),
    row=1, col=2
)

fig10.update_layout(
    title_text="Top 10 des états par affiliation politique",
    title_font_size=18,
    showlegend=True,
    height=500
)

# Affichons tous les graphiques
for fig in [fig1, fig2, fig3, fig4, fig5, fig6, fig7, fig8, fig9, fig10]:
    fig.show()
       id            state_name state_code   per_gop   per_dem  total_2016  \
0    1000               Alabama         AL  0.647359  0.342648     2078165   
1    2000                Alaska         AK  0.497797  0.420912           0   
2    4000               Arizona         AZ  0.548723  0.435861     2062810   
3    5000              Arkansas         AR  0.688531  0.282032     1108615   
4    6000            California         CA  0.439389  0.537068     9631972   
5    8000              Colorado         CO  0.559502  0.417248     2564185   
6    9000           Connecticut         CT  0.424576  0.557866     1623542   
7   10000              Delaware         DE  0.443037  0.542737      441535   
8   11000  District of Columbia         DC  0.053973  0.921497      280272   
9   12000               Florida         FL  0.633620  0.357409     9386750   
10  13000               Georgia         GA  0.639809  0.350515     4029564   
11  15000                Hawaii         HI  0.330023  0.648358      428825   
12  16000                 Idaho         ID  0.730509  0.240739      688235   
13  17000              Illinois         IL  0.652193  0.327196     5374280   
14  18000               Indiana         IN  0.688717  0.291546     2722029   
15  19000                  Iowa         IA  0.638318  0.344197     1542880   
16  20000                Kansas         KS  0.752552  0.227409     1147143   
17  21000              Kentucky         KY  0.740423  0.245121     1922218   
18  22000             Louisiana         LA  0.646492  0.339012     2027731   
19  23000                 Maine         ME  0.486435  0.485294      741550   
20  24000              Maryland         MD  0.477075  0.498362     2474543   
21  25000         Massachusetts         MA  0.308799  0.668589     3231531   
22  26000              Michigan         MI  0.596681  0.387700     4789450   
23  27000             Minnesota         MN  0.602136  0.376106     2916404   
24  28000           Mississippi         MS  0.562834  0.423827     1162987   
25  29000              Missouri         MO  0.752084  0.232851     2775098   
26  30000               Montana         MT  0.689422  0.287197      483574   
27  31000              Nebraska         NE  0.780742  0.198169      805638   
28  32000                Nevada         NV  0.696978  0.277140     1122990   
29  33000         New Hampshire         NH  0.458564  0.523999      730628   
30  34000            New Jersey         NJ  0.437915  0.544962     3674893   
31  35000            New Mexico         NM  0.532221  0.447902      783127   
32  36000              New York         NY  0.508439  0.471869     7046175   
33  37000        North Carolina         NC  0.584579  0.403189     4629471   
34  38000          North Dakota         ND  0.725569  0.247834      336968   
35  39000                  Ohio         OH  0.674596  0.310536     5325395   
36  40000              Oklahoma         OK  0.778398  0.202755     1451056   
37  41000                Oregon         OR  0.566156  0.404151     1808575   
38  42000          Pennsylvania         PA  0.635927  0.350978     5970107   
39  44000          Rhode Island         RI  0.380590  0.598526      450121   
40  45000        South Carolina         SC  0.535671  0.452410     2084444   
41  46000          South Dakota         SD  0.673640  0.305517      370047   
42  47000             Tennessee         TN  0.747807  0.237236     2484691   
43  48000                 Texas         TX  0.743895  0.245202     8903237   
44  49000                  Utah         UT  0.728772  0.241405      852461   
45  50000               Vermont         VT  0.351706  0.615269      291413   
46  51000              Virginia         VA  0.551998  0.431626     3844787   
47  53000            Washington         WA  0.520402  0.448434     2765627   
48  54000         West Virginia         WV  0.741402  0.243346      708226   
49  55000             Wisconsin         WI  0.564259  0.419635     2944620   
50  56000               Wyoming         WY  0.750912  0.217684      248742   

    dem_2016  gop_2016  percent_no_highschool  percent_highschool  ...  \
0     718084   1306925              13.819302           30.800268  ...   
1          0         0               7.152934           28.003729  ...   
2     936250   1021154              12.860705           23.858877  ...   
3     378729    677904              13.430243           34.034885  ...   
4    5931283   3184721              16.692171           20.487896  ...   
5    1212209   1137455               8.253678           21.368059  ...   
6     884432    668266               9.369879           26.854712  ...   
7     235581    185103               9.982669           31.292805  ...   
8     260223     11553               9.076816           16.835115  ...   
9    4485745   4605515              11.810859           28.573500  ...   
10   1837300   2068623              12.855142           27.714716  ...   
11    266827    128815               8.028257           27.356155  ...   
12    189677    407199               9.226700           27.356852  ...   
13   2977498   2118179              10.787586           25.954943  ...   
14   1031953   1556220              11.181375           33.406757  ...   
15    650790    798923               7.908792           30.982542  ...   
16    414788    656009               9.048409           25.906137  ...   
17    628834   1202942              13.738696           32.893917  ...   
18    779535   1178004              14.773869           33.962753  ...   
19    354873    334838               7.389770           31.473530  ...   
20   1497951    873646               9.796140           24.611042  ...   
21   1964768   1083069               9.242436           24.019262  ...   
22   2267373   2279210               9.190457           28.873878  ...   
23   1366676   1322891               6.859513           24.647589  ...   
24    462001    678457              15.493731           30.438028  ...   
25   1054889   1585753              10.078580           30.617037  ...   
26    174521    274120               6.449651           28.832081  ...   
27    273858    485819               8.595745           26.106092  ...   
28    537753    511319              13.309174           28.085070  ...   
29    348126    345598               6.894038           27.419645  ...   
30   2021756   1535513              10.183736           27.185795  ...   
31    380724    315875              14.411727           26.460430  ...   
32   4143874   2640570              13.179301           25.977776  ...   
33   2162074   2339603              12.219548           25.652466  ...   
34     93526    216133               7.351314           26.429096  ...   
35   2317001   2771984               9.621357           33.037495  ...   
36    419788    947934              11.976947           31.330032  ...   
37    934631    742506               9.287846           22.735300  ...   
38   2844705   2912941               9.480545           34.693886  ...   
39    249902    179421              11.186517           28.267363  ...   
40    849469   1143611              12.488965           29.103146  ...   
41    117442    227701               8.253245           30.237791  ...   
42    867110   1517402              12.537143           32.088009  ...   
43   3867816   4681590              16.313875           24.957039  ...   
44    237241    397004               7.719078           22.836246  ...   
45    178179     95053               7.327794           28.795351  ...   
46   1916845   1731156              10.305691           23.953545  ...   
47   1523720   1043648               8.672709           21.999466  ...   
48    187457    486198              13.097753           40.320034  ...   
49   1382210   1409467               7.791683           30.638811  ...   
50     55949    174248               6.834035           29.073072  ...   

      uic_11    uic_12  target  urban_pct  semi_urban_pct  rural_pct  \
0   0.074627  0.000000       1   0.432836        0.402985   0.164179   
1   0.344828  0.379310       1   0.103448        0.310345   0.586207   
2   0.000000  0.000000       1   0.533333        0.466667   0.000000   
3   0.026667  0.000000       1   0.266667        0.560000   0.173333   
4   0.034483  0.017241       0   0.637931        0.293103   0.068966   
5   0.125000  0.171875       1   0.265625        0.421875   0.312500   
6   0.000000  0.000000       0   0.875000        0.125000   0.000000   
7   0.000000  0.000000       0   1.000000        0.000000   0.000000   
8   0.000000  0.000000       0   1.000000        0.000000   0.000000   
9   0.000000  0.014925       1   0.656716        0.313433   0.029851   
10  0.012579  0.025157       1   0.465409        0.396226   0.138365   
11  0.000000  0.000000       0   0.600000        0.400000   0.000000   
12  0.045455  0.045455       1   0.272727        0.500000   0.227273   
13  0.019608  0.019608       1   0.392157        0.509804   0.098039   
14  0.000000  0.000000       1   0.478261        0.467391   0.054348   
15  0.050505  0.040404       1   0.212121        0.585859   0.202020   
16  0.066667  0.209524       1   0.180952        0.419048   0.400000   
17  0.041667  0.075000       1   0.291667        0.408333   0.300000   
18  0.031250  0.000000       1   0.546875        0.375000   0.078125   
19  0.187500  0.000000       1   0.312500        0.562500   0.125000   
20  0.000000  0.000000       0   0.791667        0.208333   0.000000   
21  0.071429  0.000000       0   0.785714        0.214286   0.000000   
22  0.108434  0.024096       1   0.313253        0.518072   0.168675   
23  0.034483  0.068966       1   0.310345        0.471264   0.218391   
24  0.000000  0.000000       1   0.207317        0.536585   0.256098   
25  0.034783  0.017391       1   0.295652        0.443478   0.260870   
26  0.160714  0.285714       1   0.089286        0.392857   0.517857   
27  0.086022  0.204301       1   0.139785        0.311828   0.548387   
28  0.117647  0.000000       1   0.235294        0.529412   0.235294   
29  0.000000  0.000000       0   0.300000        0.700000   0.000000   
30  0.000000  0.000000       0   1.000000        0.000000   0.000000   
31  0.090909  0.060606       1   0.212121        0.606061   0.181818   
32  0.000000  0.000000       1   0.612903        0.370968   0.016129   
33  0.010000  0.030000       1   0.460000        0.380000   0.160000   
34  0.037736  0.264151       1   0.113208        0.188679   0.698113   
35  0.000000  0.000000       1   0.431818        0.545455   0.022727   
36  0.012987  0.012987       1   0.233766        0.558442   0.207792   
37  0.055556  0.083333       1   0.361111        0.500000   0.138889   
38  0.000000  0.000000       1   0.552239        0.388060   0.059701   
39  0.000000  0.000000       0   1.000000        0.000000   0.000000   
40  0.000000  0.000000       1   0.565217        0.413043   0.021739   
41  0.030303  0.166667       1   0.121212        0.242424   0.636364   
42  0.000000  0.010526       1   0.442105        0.389474   0.168421   
43  0.051181  0.062992       1   0.322835        0.484252   0.192913   
44  0.103448  0.103448       1   0.344828        0.482759   0.172414   
45  0.071429  0.000000       0   0.214286        0.571429   0.214286   
46  0.007519  0.037594       1   0.601504        0.240602   0.157895   
47  0.000000  0.051282       1   0.538462        0.333333   0.128205   
48  0.000000  0.036364       1   0.381818        0.418182   0.200000   
49  0.027778  0.069444       1   0.361111        0.458333   0.180556   
50  0.130435  0.086957       1   0.086957        0.739130   0.173913   

    large_metro_uic  small_metro_uic  rural_uic     margin  
0          0.432836         0.447761   0.119403  30.471133  
1          0.103448         0.034483   0.862069   7.688521  
2          0.533333         0.333333   0.133333  11.286270  
3          0.266667         0.360000   0.373333  40.649831  
4          0.637931         0.258621   0.103448  -9.767912  
5          0.265625         0.187500   0.546875  14.225424  
6          0.875000         0.125000   0.000000 -13.329063  
7          1.000000         0.000000   0.000000  -9.969955  
8          1.000000         0.000000   0.000000 -86.752373  
9          0.656716         0.328358   0.014925  27.621053  
10         0.465409         0.364780   0.169811  28.929338  
11         0.600000         0.000000   0.400000 -31.833484  
12         0.272727         0.409091   0.318182  48.976967  
13         0.392157         0.362745   0.245098  32.499742  
14         0.478261         0.445652   0.076087  39.717157  
15         0.212121         0.383838   0.404040  29.412134  
16         0.180952         0.200000   0.619048  52.514368  
17         0.291667         0.266667   0.441667  49.530237  
18         0.546875         0.328125   0.125000  30.747987  
19         0.312500         0.500000   0.187500   0.114054  
20         0.791667         0.208333   0.000000  -2.128640  
21         0.785714         0.071429   0.142857 -35.979075  
22         0.313253         0.204819   0.481928  20.898059  
23         0.310345         0.390805   0.298851  22.602929  
24         0.207317         0.365854   0.426829  13.900687  
25         0.295652         0.356522   0.347826  51.923302  
26         0.089286         0.232143   0.678571  40.222544  
27         0.139785         0.193548   0.666667  58.257352  
28         0.235294         0.294118   0.470588  41.983823  
29         0.300000         0.400000   0.300000  -6.543491  
30         1.000000         0.000000   0.000000 -10.704621  
31         0.212121         0.242424   0.545455   8.431905  
32         0.612903         0.322581   0.064516   3.657005  
33         0.460000         0.400000   0.140000  18.138959  
34         0.113208         0.207547   0.679245  47.773558  
35         0.431818         0.500000   0.068182  36.405993  
36         0.233766         0.324675   0.441558  57.564274  
37         0.361111         0.277778   0.361111  16.200463  
38         0.552239         0.343284   0.104478  28.494876  
39         1.000000         0.000000   0.000000 -21.793601  
40         0.565217         0.413043   0.021739   8.326079  
41         0.121212         0.212121   0.666667  36.812326  
42         0.442105         0.410526   0.147368  51.057099  
43         0.322835         0.385827   0.291339  49.869359  
44         0.344828         0.206897   0.448276  48.736691  
45         0.214286         0.357143   0.428571 -26.356277  
46         0.601504         0.285714   0.112782  12.037252  
47         0.538462         0.333333   0.128205   7.196842  
48         0.381818         0.400000   0.218182  49.805581  
49         0.361111         0.486111   0.152778  14.462402  
50         0.086957         0.086957   0.826087  53.322846  

[51 rows x 46 columns]

On voit maintenant que c'est plus claire


Voici les variables à utiliser pour notre modèle, basées sur l'analyse des corrélations:

Pertinent:

  1. percent_college (0.57) - Forte corrélation positive avec la cible
  2. small_metro_uic (0.57) - Forte corrélation positive avec la cible
  3. semi_urban_pct (0.52) - Bonne corrélation positive avec la cible
  4. median_household_income (-0.66) - Forte corrélation négative avec la cible
  5. percent_poverty (0.38) - Corrélation modérée positive avec la cible

Possible:

  1. rural_pct (0.46) - Bonne corrélation positive avec la cible
  2. percent_bachelor (-0.60) - Forte corrélation négative avec la cible

Cette sélection nous permettra de capturer les différentes dimensions qui influencent votre variable cible tout en limitant la multicolinéarité. Elle combine des indicateurs socio-économiques (revenus, éducation, pauvreté) et des indicateurs géographiques (degré d'urbanisation).

5. MODELISATION¶

5.1. CheckUps¶

Checking de la repartition des votes

In [295]:
df["target"].value_counts(normalize=True)
Out[295]:
target
1    0.784314
0    0.215686
Name: proportion, dtype: float64

Observation : Le dataset est très déséquilibré :

  • 78.4% des états ont voté Républicain (1)
  • 21.6% ont voté Démocrate (0)
Cela peut poser problème, car un modèle de classification risque de favoriser la classe majoritaire et mal prédire les Démocrates.¶

5.2. Séparation features et target¶

In [296]:
df.columns
Out[296]:
Index(['id', 'state_name', 'state_code', 'per_gop', 'per_dem', 'total_2016',
       'dem_2016', 'gop_2016', 'percent_no_highschool', 'percent_highschool',
       'percent_college', 'percent_bachelor', 'percent_poverty',
       'median_household_income', 'unemployment_rate', 'Employed_2019',
       'Unemployed_2019', 'ruc_1', 'ruc_2', 'ruc_3', 'ruc_4', 'ruc_5', 'ruc_6',
       'ruc_7', 'ruc_8', 'ruc_9', 'uic_1', 'uic_2', 'uic_3', 'uic_4', 'uic_5',
       'uic_6', 'uic_7', 'uic_8', 'uic_9', 'uic_10', 'uic_11', 'uic_12',
       'target', 'urban_pct', 'semi_urban_pct', 'rural_pct', 'large_metro_uic',
       'small_metro_uic', 'rural_uic', 'margin'],
      dtype='object')
In [297]:
# Liste des colonnes pertinentes
selected_columns = [
    "id","state_name", "state_code",
    "percent_college", "semi_urban_pct", "median_household_income", 
    "percent_poverty", "rural_pct", "percent_bachelor"
]
In [298]:
selected_features = [col for col in selected_columns if col in df.columns]

# Création des jeux de données
X = df[selected_features]
y = df["target"]

5.3. Encodage des variables catégorielles¶

In [299]:
categorical_features = ["id","state_name", "state_code"]
encoder = OneHotEncoder(drop="first", sparse_output=False)
X_encoded = pd.DataFrame(encoder.fit_transform(X[categorical_features]))
In [300]:
X_encoded.columns = encoder.get_feature_names_out(categorical_features)
X = X.drop(columns=categorical_features)
In [301]:
# Concaténation les features encodées avec les autres features
X_final = pd.concat([X, X_encoded], axis=1)
X_final.head()
Out[301]:
percent_college semi_urban_pct median_household_income percent_poverty rural_pct percent_bachelor id_2000 id_4000 id_5000 id_6000 ... state_code_SD state_code_TN state_code_TX state_code_UT state_code_VA state_code_VT state_code_WA state_code_WI state_code_WV state_code_WY
0 29.912098 0.402985 51771 15.6 0.164179 25.468332 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 35.292122 0.310345 77203 10.2 0.586207 29.551214 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 33.813610 0.466667 62027 13.5 0.000000 29.466806 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 29.507084 0.560000 49020 16.0 0.173333 23.027790 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 28.893970 0.293103 80423 11.8 0.068966 33.925964 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 156 columns


5.4. Modélisation proprement dite¶

In [302]:
def model_comparison(X, y):
    # Séparation des données en train et test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # List of models to evaluate
    models = {
        'Logistic Regression': LogisticRegression(max_iter=5000, random_state=42),
        'Random Forest': RandomForestClassifier(random_state=42),
        'XGBoost': xgb.XGBClassifier(random_state=42)
    }

    # Paramètres à tester dans GridSearchCV pour chaque modèle
    param_grids = {
        'Logistic Regression': {
            'logreg__C': [0.1, 1, 10],
            'logreg__solver': ['lbfgs', 'liblinear']
        },
        'Random Forest': {
            'rf__n_estimators': [100, 200, 500],
            'rf__max_depth': [10, 20, None],
            'rf__min_samples_split': [2, 5],
            'rf__min_samples_leaf': [1, 2]
        },
        'XGBoost': {
            'xgb__learning_rate': [0.01, 0.1, 0.3],
            'xgb__max_depth': [3, 6, 10],
            'xgb__n_estimators': [50, 100, 200],
            'xgb__subsample': [0.8, 1.0],
            'xgb__colsample_bytree': [0.8, 1.0]
        }
    }
    
    results = []
    
    for model_name, model in models.items():
        # Création d'un pipeline pour chaque modèle
        if model_name == 'Logistic Regression':
            pipeline = Pipeline([('scaler', StandardScaler()), ('logreg', model)])
        elif model_name == 'Random Forest':
            pipeline = Pipeline([('scaler', StandardScaler()), ('rf', model)])
        else:  # XGBoost
            pipeline = Pipeline([('scaler', StandardScaler()), ('xgb', model)])

        # GridSearchCV
        grid_search = GridSearchCV(pipeline, param_grids[model_name], cv=5, scoring='f1', n_jobs=-1)
        
        # Entraînement et évaluation
        grid_search.fit(X_train, y_train)
        y_pred_train = grid_search.predict(X_train)
        y_pred_test = grid_search.predict(X_test)
        
        # Collecte des résultats
        results.append({
            'Model': model_name,
            'Best Params': grid_search.best_params_,
            'Train F1-Score': classification_report(y_train, y_pred_train, output_dict=True)['1']['f1-score'],
            'Test F1-Score': classification_report(y_test, y_pred_test, output_dict=True)['1']['f1-score'],
            'Train Accuracy': classification_report(y_train, y_pred_train, output_dict=True)['accuracy'],
            'Test Accuracy': classification_report(y_test, y_pred_test, output_dict=True)['accuracy'],
            'Train Recall': classification_report(y_train, y_pred_train, output_dict=True)['1']['recall'],
            'Test Recall': classification_report(y_test, y_pred_test, output_dict=True)['1']['recall'],
            'Train Precision': classification_report(y_train, y_pred_train, output_dict=True)['1']['precision'],
            'Test Precision': classification_report(y_test, y_pred_test, output_dict=True)['1']['precision']
        })
    
    # Conversion des résultats en DataFrame
    results_df = pd.DataFrame(results)
    
    return results_df

def plot_model_metrics(results_df):
    """
    Function to plot test metrics from model comparison results
    """
    models = results_df['Model']
    test_accuracy = results_df['Test Accuracy']
    test_f1 = results_df['Test F1-Score']
    test_recall = results_df['Test Recall']
    
    # Couleurs harmonieuses et moins contrastées
    colors = ['#8ecae6', '#219ebc', '#126782']
    
    x = np.arange(len(models))
    width = 0.25  # Plus étroit pour accommoder 3 barres
    
    plt.figure(figsize=(12, 7))
    
    # Création des barres avec les nouvelles couleurs
    plt.bar(x - width, test_accuracy, width, label='Test Accuracy', color=colors[0], alpha=0.8)
    plt.bar(x, test_f1, width, label='Test F1-Score', color=colors[1], alpha=0.8)
    plt.bar(x + width, test_recall, width, label='Test Recall', color=colors[2], alpha=0.8)
    
    plt.xlabel('Modèles', fontsize=12)
    plt.ylabel('Score', fontsize=12)
    plt.title('Comparaison des métriques de test par modèle', fontsize=14)
    plt.xticks(x, models, rotation=15, ha='right')
    plt.ylim(0, 1.1)
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.3)
    
    # Ajout des valeurs sur les barres
    for i, v in enumerate(test_accuracy):
        plt.text(i - width, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    for i, v in enumerate(test_f1):
        plt.text(i, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    for i, v in enumerate(test_recall):
        plt.text(i + width, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    
    plt.tight_layout()
    plt.show()
In [303]:
def plot_accuracy_comparison(results_df):
    """
    Function to compare train and test accuracy for each model
    """
    models = results_df['Model']
    train_accuracy = results_df['Train Accuracy']
    test_accuracy = results_df['Test Accuracy']
    
    # Couleurs harmonieuses
    colors = ['#219ebc', '#fb8500']
    
    x = np.arange(len(models))
    width = 0.35
    
    plt.figure(figsize=(12, 7))
    
    plt.bar(x - width/2, train_accuracy, width, label='Train Accuracy', color=colors[0], alpha=0.8)
    plt.bar(x + width/2, test_accuracy, width, label='Test Accuracy', color=colors[1], alpha=0.8)
    
    plt.xlabel('Modèles', fontsize=12)
    plt.ylabel('Accuracy', fontsize=12)
    plt.title('Comparaison de l\'accuracy en train et test par modèle', fontsize=14)
    plt.xticks(x, models, rotation=15, ha='right')
    plt.ylim(0, 1.1)
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.3)
    
    # Ajout des valeurs sur les barres
    for i, v in enumerate(train_accuracy):
        plt.text(i - width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    for i, v in enumerate(test_accuracy):
        plt.text(i + width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    
    # Calcul et affichage de la différence entre train et test (overfitting)
    for i in range(len(models)):
        diff = train_accuracy[i] - test_accuracy[i]
        plt.text(i, max(train_accuracy[i], test_accuracy[i]) + 0.08, 
                 f'Diff: {diff:.2f}', ha='center', fontsize=10, color='#d62828')
    
    plt.tight_layout()
    plt.show()

def plot_recall_comparison(results_df):
    """
    Function to compare train and test recall for each model
    """
    models = results_df['Model']
    train_recall = results_df['Train Recall']
    test_recall = results_df['Test Recall']
    
    # Couleurs harmonieuses
    colors = ['#4cc9f0', '#f72585']
    
    x = np.arange(len(models))
    width = 0.35
    
    plt.figure(figsize=(12, 7))
    
    plt.bar(x - width/2, train_recall, width, label='Train Recall', color=colors[0], alpha=0.8)
    plt.bar(x + width/2, test_recall, width, label='Test Recall', color=colors[1], alpha=0.8)
    
    plt.xlabel('Modèles', fontsize=12)
    plt.ylabel('Recall', fontsize=12)
    plt.title('Comparaison du recall en train et test par modèle', fontsize=14)
    plt.xticks(x, models, rotation=15, ha='right')
    plt.ylim(0, 1.1)
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.3)
    
    # Ajout des valeurs sur les barres
    for i, v in enumerate(train_recall):
        plt.text(i - width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    for i, v in enumerate(test_recall):
        plt.text(i + width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    
    # Calcul et affichage de la différence entre train et test
    for i in range(len(models)):
        diff = train_recall[i] - test_recall[i]
        plt.text(i, max(train_recall[i], test_recall[i]) + 0.08, 
                 f'Diff: {diff:.2f}', ha='center', fontsize=10, color='#d62828')
    
    plt.tight_layout()
    plt.show()

def plot_f1_comparison(results_df):
    """
    Function to compare train and test F1-Score for each model
    """
    models = results_df['Model']
    train_f1 = results_df['Train F1-Score']
    test_f1 = results_df['Test F1-Score']
    
    # Couleurs harmonieuses
    colors = ['#2a9d8f', '#e76f51']
    
    x = np.arange(len(models))
    width = 0.35
    
    plt.figure(figsize=(12, 7))
    
    plt.bar(x - width/2, train_f1, width, label='Train F1-Score', color=colors[0], alpha=0.8)
    plt.bar(x + width/2, test_f1, width, label='Test F1-Score', color=colors[1], alpha=0.8)
    
    plt.xlabel('Modèles', fontsize=12)
    plt.ylabel('F1-Score', fontsize=12)
    plt.title('Comparaison du F1-Score en train et test par modèle', fontsize=14)
    plt.xticks(x, models, rotation=15, ha='right')
    plt.ylim(0, 1.1)
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.3)
    
    # Ajout des valeurs sur les barres
    for i, v in enumerate(train_f1):
        plt.text(i - width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    for i, v in enumerate(test_f1):
        plt.text(i + width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
    
    # Calcul et affichage de la différence entre train et test
    for i in range(len(models)):
        diff = train_f1[i] - test_f1[i]
        plt.text(i, max(train_f1[i], test_f1[i]) + 0.08, 
                 f'Diff: {diff:.2f}', ha='center', fontsize=10, color='#d62828')
    
    plt.tight_layout()
    plt.show()
In [304]:
# Entrainement des modèles
results = model_comparison(X_final, y)
results
Out[304]:
Model Best Params Train F1-Score Test F1-Score Train Accuracy Test Accuracy Train Recall Test Recall Train Precision Test Precision
0 Logistic Regression {'logreg__C': 0.1, 'logreg__solver': 'liblinear'} 1.0 0.909091 1.0 0.8750 1.0 0.833333 1.0 1.000000
1 Random Forest {'rf__max_depth': 10, 'rf__min_samples_leaf': ... 1.0 0.960000 1.0 0.9375 1.0 1.000000 1.0 0.923077
2 XGBoost {'xgb__colsample_bytree': 0.8, 'xgb__learning_... 1.0 0.869565 1.0 0.8125 1.0 0.833333 1.0 0.909091

6. EVALUATION¶

In [305]:
# Affichage des resultats
plot_model_metrics(results)
No description has been provided for this image

Analyse comparative des modèles

  • Random Forest se démarque avec des performances supérieures sur toutes les métriques (Recall: 1.00, F1-Score: 0.96, Accuracy: 0.94).
  • Logistic Regression présente des résultats intermédiaires, tandis que
  • XGBoost affiche des performances légèrement moindres.
    Pour ce problème, Random Forest offre le meilleur équilibre entre les différentes métriques.
In [306]:
plot_accuracy_comparison(results)
plot_recall_comparison(results)
plot_f1_comparison(results)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Analyse de la capacité de généralisation des modèles¶

Nous avons entraîné trois modèles de classification différents (Régression Logistique, Random Forest et XGBoost) et comparé leurs performances sur les jeux de données d'entraînement et de test. Cette comparaison nous permet d'évaluer leur capacité de généralisation.

Résultats observés¶

Accuracy¶
  • Tous les modèles atteignent une accuracy parfaite (1.00) sur les données d'entraînement
  • Sur les données de test:
    • Random Forest: 0.94 (Diff: 0.06)
    • Logistic Regression: 0.88 (Diff: 0.12)
    • XGBoost: 0.81 (Diff: 0.19)
Recall¶
  • Tous les modèles obtiennent un recall parfait (1.00) sur les données d'entraînement
  • Sur les données de test:
    • Random Forest: 1.00 (Diff: 0.00)
    • Logistic Regression: 0.83 (Diff: 0.17)
    • XGBoost: 0.83 (Diff: 0.17)
F1-Score¶
  • Tous les modèles atteignent un F1-Score parfait (1.00) sur les données d'entraînement
  • Sur les données de test:
    • Random Forest: 0.96 (Diff: 0.04)
    • Logistic Regression: 0.91 (Diff: 0.09)
    • XGBoost: 0.87 (Diff: 0.13)

Interprétation¶

Les graphiques de comparaison train/test nous permettent d'identifier clairement la capacité de généralisation de chaque modèle:

  1. Random Forest présente la meilleure capacité de généralisation:

    • Performances élevées sur toutes les métriques en test
    • Écarts minimaux entre train et test
    • Capacité exceptionnelle à maintenir un recall parfait sur les données de test
  2. Logistic Regression montre une généralisation satisfaisante:

    • Performances correctes en test
    • Écarts modérés entre train et test
    • Bon équilibre entre precision et recall (bon F1-Score)
  3. XGBoost présente des signes de surapprentissage:

    • Performances plus faibles en test comparées aux autres modèles
    • Écarts plus importants entre train et test
    • Pourrait bénéficier d'une meilleure régularisation

On note que le recall, le score f1 ou encore l'accuracy en train on eut des scrore parfait ce qui peut etre problématique et peut faire penser à un overfitting (surapprentissage)

Conclusion¶

La visualisation systématique des performances sur les jeux d'entraînement et de test nous permet de justifier objectivement le choix du modèle Random Forest comme étant le plus fiable pour la généralisation à de nouvelles données. Ce modèle offre le meilleur compromis entre performance prédictive et stabilité entre les différents jeux de données.

Procédons l'analyse des parametres pour voir les plus importantes¶

In [309]:
def analyze_feature_importance(X, y, model_results):
    """
    Analyse l'importance des variables de façon globale et locale avec SHAP
    Supporte les modèles binaires et multi-classes
    
    Parameters:
    -----------
    X : DataFrame
        Les features utilisées pour l'entraînement
    y : Series
        La variable cible
    model_results : DataFrame
        Résultats de la fonction model_comparison
        
    Returns:
    --------
    dict: Dictionnaire contenant les résultats de l'analyse
    """
    
    # Récupération du meilleur modèle (Random Forest)
    best_params = model_results.loc[model_results['Model'] == 'Random Forest', 'Best Params'].values[0]
    
    # Extraction des paramètres optimaux pour Random Forest
    params = {}
    for key, value in best_params.items():
        if key.startswith('rf__'):
            params[key[4:]] = value  # Enlever le préfixe 'rf__'
    
    # Création et entraînement du modèle avec les meilleurs paramètres
    best_rf = RandomForestClassifier(**params, random_state=42)
    
    # Création et entraînement du modèle de régression logistique pour comparer
    best_params_logreg = model_results.loc[model_results['Model'] == 'Logistic Regression', 'Best Params'].values[0]
    params_logreg = {}
    for key, value in best_params_logreg.items():
        if key.startswith('logreg__'):
            params_logreg[key[8:]] = value  # Enlever le préfixe 'logreg__'
    
    best_logreg = LogisticRegression(**params_logreg, random_state=42)
    
    # Standardisation des données pour la régression logistique
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Entraînement des modèles
    best_rf.fit(X, y)
    best_logreg.fit(X_scaled, y)
    
    # 1. Analyse globale - Feature Importance pour Random Forest
    plt.figure(figsize=(10, 6))
    feature_importances = pd.DataFrame(
        {'feature': X.columns, 'importance': best_rf.feature_importances_}
    ).sort_values('importance', ascending=False)
    
    sns.barplot(x='importance', y='feature', data=feature_importances, palette='Blues_d')
    plt.title('Importance des variables - Random Forest', fontsize=14)
    plt.xlabel('Importance')
    plt.ylabel('Variables')
    plt.tight_layout()
    plt.show()
    
    # 2. Analyse globale - Coefficients pour Régression Logistique
    plt.figure(figsize=(10, 6))
    coefs = pd.DataFrame(
        {'feature': X.columns, 'coefficient': best_logreg.coef_[0]}
    ).sort_values('coefficient', ascending=False)
    
    # Utiliser une palette de couleurs différentes pour les coefficients positifs et négatifs
    colors = ['#FF9999' if c < 0 else '#66B2FF' for c in coefs['coefficient']]
    
    sns.barplot(x='coefficient', y='feature', data=coefs, palette=colors)
    plt.title('Coefficients - Régression Logistique', fontsize=14)
    plt.axvline(x=0, color='gray', linestyle='--')
    plt.xlabel('Coefficient')
    plt.ylabel('Variables')
    plt.tight_layout()
    plt.show()
    
    # 3. Analyse locale - SHAP pour Random Forest
    # Échantillonnage si le dataset est très grand
    sample_size = min(100, len(X))
    X_sample = X.sample(sample_size, random_state=42)
    
    # Création de l'explainer SHAP
    explainer = shap.TreeExplainer(best_rf)
    
    # Pour les graphiques summary plot, on peut encore utiliser shap_values
    shap_values = explainer.shap_values(X_sample)
    
    # Vérifier si nous avons un modèle binaire
    is_binary = isinstance(shap_values, list) or (isinstance(shap_values, np.ndarray) and shap_values.ndim > 2)
    
    # Résumé des valeurs SHAP - utiliser la classe 1 pour classification binaire
    plt.figure(figsize=(10, 8))
    if is_binary:
        # Pour les modèles binaires, utiliser l'index 1 (classe positive)
        shap.summary_plot(shap_values[1] if isinstance(shap_values, list) else shap_values[:, :, 1], 
                          X_sample, plot_type="bar", show=False)
    else:
        shap.summary_plot(shap_values, X_sample, plot_type="bar", show=False)
    
    plt.title('Résumé des valeurs SHAP - Random Forest', fontsize=14)
    plt.tight_layout()
    plt.show()
    
    # Détail des valeurs SHAP
    plt.figure(figsize=(12, 10))
    if is_binary:
        # Pour les modèles binaires, utiliser l'index 1 (classe positive)
        shap.summary_plot(shap_values[1] if isinstance(shap_values, list) else shap_values[:, :, 1], 
                          X_sample, show=False)
    else:
        shap.summary_plot(shap_values, X_sample, show=False)
    
    plt.title('Impact des variables sur les prédictions (SHAP) - Random Forest', fontsize=14)
    plt.tight_layout()
    plt.show()
    
    # 4. Analyse d'un exemple spécifique
    try:
        # Prendre un exemple au hasard
        example_idx = np.random.randint(0, len(X_sample))
        
        print(f"Analyse d'un exemple spécifique (indice {example_idx}):")
        
        # Afficher les valeurs de l'exemple
        example_data = X_sample.iloc[example_idx]
        print("\nValeurs de l'exemple:")
        for feature, value in example_data.items():
            print(f"  {feature}: {value}")
        
        # Tenter d'utiliser le SHAP Dependence plot au lieu du waterfall
        # qui est plus robuste pour les modèles binaires
        plt.figure(figsize=(10, 6))
        
        # Identifier la feature la plus importante
        if is_binary:
            feature_importance = np.abs(shap_values[1]).mean(0) if isinstance(shap_values, list) else np.abs(shap_values[:, :, 1]).mean(0)
        else:
            feature_importance = np.abs(shap_values).mean(0)
            
        most_important_idx = np.argmax(feature_importance)
        most_important_feature = X.columns[most_important_idx]
        
        # Créer un dependence plot pour la feature la plus importante
        if is_binary:
            shap_values_to_plot = shap_values[1] if isinstance(shap_values, list) else shap_values[:, :, 1]
        else:
            shap_values_to_plot = shap_values
            
        shap.dependence_plot(
            most_important_idx, 
            shap_values_to_plot, 
            X_sample,
            feature_names=X.columns,
            show=False
        )
        
        plt.title(f'Dependence plot pour {most_important_feature}', fontsize=14)
        plt.tight_layout()
        plt.show()
        
        # Créer une fonction pour visualiser les contributions des features pour un exemple
        def plot_feature_contributions(example_idx, shap_values, X_sample, is_binary=False):
            # Extraire les valeurs SHAP pour l'exemple
            if is_binary:
                shap_vals = shap_values[1][example_idx] if isinstance(shap_values, list) else shap_values[example_idx, :, 1]
            else:
                shap_vals = shap_values[example_idx]
                
            # Créer un DataFrame pour la visualisation
            contrib_df = pd.DataFrame({
                'Feature': X_sample.columns,
                'Contribution': shap_vals
            }).sort_values('Contribution', ascending=False)
            
            # Visualiser
            plt.figure(figsize=(10, 6))
            bars = plt.barh(contrib_df['Feature'], contrib_df['Contribution'])
            
            # Colorer les barres en fonction de leur contribution (positive/négative)
            for i, bar in enumerate(bars):
                if contrib_df['Contribution'].iloc[i] > 0:
                    bar.set_color('#66B2FF')  # Bleu pour contributions positives
                else:
                    bar.set_color('#FF9999')  # Rouge pour contributions négatives
            
            plt.axvline(x=0, color='gray', linestyle='--')
            plt.title(f'Contributions des variables pour l\'exemple #{example_idx}', fontsize=14)
            plt.xlabel('Impact sur la prédiction')
            plt.tight_layout()
            plt.show()
            
            # Afficher les contributions numériques
            print("\nContributions des variables (top 5):")
            for _, row in contrib_df.head(5).iterrows():
                print(f"  {row['Feature']}: {row['Contribution']:.4f}")
                
            print("\nContributions des variables (bottom 5):")
            for _, row in contrib_df.tail(5).iterrows():
                print(f"  {row['Feature']}: {row['Contribution']:.4f}")
                
            # Calculer la prédiction
            if is_binary:
                expected_value = explainer.expected_value[1] if isinstance(explainer.expected_value, list) else explainer.expected_value
            else:
                expected_value = explainer.expected_value
                
            prediction = expected_value + np.sum(shap_vals)
            print(f"\nValeur de base (expected value): {expected_value:.4f}")
            print(f"Somme des contributions: {np.sum(shap_vals):.4f}")
            print(f"Prédiction finale: {prediction:.4f}")
            
            if is_binary:
                print(f"Probabilité: {1 / (1 + np.exp(-prediction)):.4f}")
        
        # Utiliser notre fonction personnalisée
        plot_feature_contributions(example_idx, shap_values, X_sample, is_binary)
            
    except Exception as e:
        print(f"Erreur lors de l'analyse de l'exemple: {e}")
    
    print("\nAnalyse de l'importance des variables terminée.")
    
    return {
        "feature_importance_rf": feature_importances,
        "coefficients_logreg": coefs,
        "shap_explainer": explainer,
        "shap_values": shap_values,
        "X_sample": X_sample
    }
In [310]:
importance_analysis = analyze_feature_importance(X, y, results)
importance_analysis
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Analyse d'un exemple spécifique (indice 27):

Valeurs de l'exemple:
  percent_college: 29.9120979309082
  semi_urban_pct: 0.4029850746268656
  median_household_income: 51771.0
  percent_poverty: 15.6
  rural_pct: 0.16417910447761197
  percent_bachelor: 25.46833229064941
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
Contributions des variables (top 5):
  rural_pct: 0.0588
  median_household_income: 0.0429
  percent_bachelor: 0.0330
  semi_urban_pct: 0.0325
  percent_college: 0.0297

Contributions des variables (bottom 5):
  median_household_income: 0.0429
  percent_bachelor: 0.0330
  semi_urban_pct: 0.0325
  percent_college: 0.0297
  percent_poverty: 0.0120
Erreur lors de l'analyse de l'exemple: unsupported format string passed to numpy.ndarray.__format__

Analyse de l'importance des variables terminée.
Out[310]:
{'feature_importance_rf':                    feature  importance
 4                rural_pct    0.296621
 2  median_household_income    0.192839
 0          percent_college    0.168414
 1           semi_urban_pct    0.164082
 5         percent_bachelor    0.118620
 3          percent_poverty    0.059424,
 'coefficients_logreg':                    feature  coefficient
 0          percent_college     0.338587
 4                rural_pct     0.291271
 1           semi_urban_pct     0.283146
 3          percent_poverty     0.198093
 5         percent_bachelor    -0.249499
 2  median_household_income    -0.361014,
 'shap_explainer': <shap.explainers._tree.TreeExplainer at 0x7fdad4ea3a30>,
 'shap_values': array([[[-5.69003582e-03,  5.69003582e-03],
         [-3.70237548e-02,  3.70237548e-02],
         [-4.29828621e-02,  4.29828621e-02],
         [-1.73212954e-02,  1.73212954e-02],
         [-5.81950436e-02,  5.81950436e-02],
         [-3.76105377e-02,  3.76105377e-02]],
 
        [[-3.23404282e-02,  3.23404282e-02],
         [-3.54733745e-02,  3.54733745e-02],
         [-4.54803791e-02,  4.54803791e-02],
         [-2.04290890e-02,  2.04290890e-02],
         [-2.80710254e-02,  2.80710254e-02],
         [-3.70292332e-02,  3.70292332e-02]],
 
        [[ 1.30980957e-02, -1.30980957e-02],
         [ 1.01373954e-02, -1.01373954e-02],
         [-4.99943063e-02,  4.99943063e-02],
         [-5.31275008e-04,  5.31275008e-04],
         [-9.00251500e-02,  9.00251500e-02],
         [ 1.08491711e-01, -1.08491711e-01]],
 
        [[-5.58919284e-02,  5.58919284e-02],
         [-3.11128012e-02,  3.11128012e-02],
         [-3.87881638e-02,  3.87881638e-02],
         [-2.41225010e-03,  2.41225010e-03],
         [-5.01496435e-02,  5.01496435e-02],
         [-3.04687423e-02,  3.04687423e-02]],
 
        [[-4.34703395e-02,  4.34703395e-02],
         [-2.94851777e-02,  2.94851777e-02],
         [-4.10398173e-02,  4.10398173e-02],
         [-1.00328889e-02,  1.00328889e-02],
         [-4.95695039e-02,  4.95695039e-02],
         [-3.52258022e-02,  3.52258022e-02]],
 
        [[-4.03918529e-02,  4.03918529e-02],
         [-9.73775060e-03,  9.73775060e-03],
         [-4.12205966e-02,  4.12205966e-02],
         [-8.22771787e-03,  8.22771787e-03],
         [-5.34671897e-02,  5.34671897e-02],
         [-3.57784217e-02,  3.57784217e-02]],
 
        [[-3.23246441e-02,  3.23246441e-02],
         [-3.29164854e-02,  3.29164854e-02],
         [-4.26690692e-02,  4.26690692e-02],
         [-1.30768543e-02,  1.30768543e-02],
         [-5.48511427e-02,  5.48511427e-02],
         [-3.29853337e-02,  3.29853337e-02]],
 
        [[ 1.29090915e-01, -1.29090915e-01],
         [-6.44150689e-02,  6.44150689e-02],
         [-4.65608541e-02,  4.65608541e-02],
         [-5.05119758e-02,  5.05119758e-02],
         [-5.07835557e-02,  5.07835557e-02],
         [ 3.43570101e-02, -3.43570101e-02]],
 
        [[-3.24619072e-02,  3.24619072e-02],
         [-3.03973937e-02,  3.03973937e-02],
         [-4.28018291e-02,  4.28018291e-02],
         [-1.15030465e-02,  1.15030465e-02],
         [-5.48845311e-02,  5.48845311e-02],
         [-3.67748218e-02,  3.67748218e-02]],
 
        [[ 1.24248851e-01, -1.24248851e-01],
         [ 1.35224880e-01, -1.35224880e-01],
         [ 1.50157423e-01, -1.50157423e-01],
         [ 3.65254869e-02, -3.65254869e-02],
         [ 2.42168606e-01, -2.42168606e-01],
         [ 1.02851224e-01, -1.02851224e-01]],
 
        [[ 1.36949614e-03, -1.36949614e-03],
         [-5.22828835e-02,  5.22828835e-02],
         [-4.87709754e-02,  4.87709754e-02],
         [-1.37170739e-03,  1.37170739e-03],
         [-7.81701520e-02,  7.81701520e-02],
         [ 2.04026927e-02, -2.04026927e-02]],
 
        [[ 1.30252480e-01, -1.30252480e-01],
         [ 1.47945364e-01, -1.47945364e-01],
         [ 1.57618067e-01, -1.57618067e-01],
         [-2.13457296e-02,  2.13457296e-02],
         [ 2.31761643e-01, -2.31761643e-01],
         [ 1.14944647e-01, -1.14944647e-01]],
 
        [[-4.25333267e-02,  4.25333267e-02],
         [-3.39742395e-02,  3.39742395e-02],
         [-4.00258930e-02,  4.00258930e-02],
         [-4.38139975e-03,  4.38139975e-03],
         [-5.76275179e-02,  5.76275179e-02],
         [-3.02811526e-02,  3.02811526e-02]],
 
        [[ 1.58339954e-01, -1.58339954e-01],
         [ 9.75818029e-02, -9.75818029e-02],
         [ 1.30937814e-01, -1.30937814e-01],
         [ 1.40137791e-02, -1.40137791e-02],
         [ 2.60197755e-01, -2.60197755e-01],
         [ 1.10105366e-01, -1.10105366e-01]],
 
        [[-7.49791350e-02,  7.49791350e-02],
         [-3.83200571e-02,  3.83200571e-02],
         [ 8.96482594e-02, -8.96482594e-02],
         [-6.15822698e-03,  6.15822698e-03],
         [-7.67519150e-02,  7.67519150e-02],
         [ 7.73754529e-03, -7.73754529e-03]],
 
        [[ 4.38430883e-02, -4.38430883e-02],
         [ 9.18289826e-02, -9.18289826e-02],
         [ 2.71347826e-01, -2.71347826e-01],
         [ 1.67548636e-02, -1.67548636e-02],
         [ 3.25221134e-02, -3.25221134e-02],
         [ 7.48795970e-02, -7.48795970e-02]],
 
        [[-3.92724441e-02,  3.92724441e-02],
         [-2.98950497e-02,  2.98950497e-02],
         [-4.11336549e-02,  4.11336549e-02],
         [-1.03783352e-02,  1.03783352e-02],
         [-5.13692238e-02,  5.13692238e-02],
         [-3.67748218e-02,  3.67748218e-02]],
 
        [[-3.15281459e-02,  3.15281459e-02],
         [-3.23469056e-02,  3.23469056e-02],
         [-4.29226776e-02,  4.29226776e-02],
         [-1.35254596e-02,  1.35254596e-02],
         [-6.26290540e-02,  6.26290540e-02],
         [-2.58712866e-02,  2.58712866e-02]],
 
        [[-3.97712179e-02,  3.97712179e-02],
         [-2.59650266e-02,  2.59650266e-02],
         [-4.25257047e-02,  4.25257047e-02],
         [-4.05397639e-03,  4.05397639e-03],
         [-6.51682577e-02,  6.51682577e-02],
         [-2.13393461e-02,  2.13393461e-02]],
 
        [[ 1.19078681e-01, -1.19078681e-01],
         [-4.23660377e-02,  4.23660377e-02],
         [-4.32820707e-02,  4.32820707e-02],
         [-3.74501565e-02,  3.74501565e-02],
         [-5.78672157e-02,  5.78672157e-02],
         [-5.69367293e-02,  5.69367293e-02]],
 
        [[-4.99420666e-02,  4.99420666e-02],
         [-1.01766532e-02,  1.01766532e-02],
         [-3.88150760e-02,  3.88150760e-02],
         [-1.31048728e-04,  1.31048728e-04],
         [-5.56475135e-02,  5.56475135e-02],
         [-3.41111715e-02,  3.41111715e-02]],
 
        [[-3.19006032e-02,  3.19006032e-02],
         [-2.46983026e-02,  2.46983026e-02],
         [-4.79928095e-02,  4.79928095e-02],
         [-2.19612368e-02,  2.19612368e-02],
         [-2.47388986e-02,  2.47388986e-02],
         [-3.75316788e-02,  3.75316788e-02]],
 
        [[-5.31273647e-02,  5.31273647e-02],
         [-3.20356891e-02,  3.20356891e-02],
         [-4.18200530e-02,  4.18200530e-02],
         [-3.87507927e-03,  3.87507927e-03],
         [-7.71999604e-02,  7.71999604e-02],
         [ 9.23461708e-03, -9.23461708e-03]],
 
        [[-4.49079717e-02,  4.49079717e-02],
         [-3.15105770e-02,  3.15105770e-02],
         [-3.93178490e-02,  3.93178490e-02],
         [-1.47056493e-02,  1.47056493e-02],
         [-6.71094962e-02,  6.71094962e-02],
         [-1.12719862e-02,  1.12719862e-02]],
 
        [[-5.80460238e-02,  5.80460238e-02],
         [-4.29680437e-02,  4.29680437e-02],
         [-3.06914329e-02,  3.06914329e-02],
         [ 3.06823174e-02, -3.06823174e-02],
         [-7.85338002e-02,  7.85338002e-02],
         [ 1.07334537e-02, -1.07334537e-02]],
 
        [[-3.49605994e-02,  3.49605994e-02],
         [-3.72765775e-02,  3.72765775e-02],
         [-4.26830748e-02,  4.26830748e-02],
         [-1.20440347e-02,  1.20440347e-02],
         [-5.16592915e-02,  5.16592915e-02],
         [-3.01999514e-02,  3.01999514e-02]],
 
        [[-3.86336028e-02,  3.86336028e-02],
         [-2.75128896e-02,  2.75128896e-02],
         [ 2.15482691e-01, -2.15482691e-01],
         [ 7.66952686e-02, -7.66952686e-02],
         [ 2.98277355e-01, -2.98277355e-01],
         [ 4.68676491e-02, -4.68676491e-02]],
 
        [[-2.96763224e-02,  2.96763224e-02],
         [-3.24664485e-02,  3.24664485e-02],
         [-4.29086720e-02,  4.29086720e-02],
         [-1.19975833e-02,  1.19975833e-02],
         [-5.87891694e-02,  5.87891694e-02],
         [-3.29853337e-02,  3.29853337e-02]],
 
        [[ 2.62401617e-01, -2.62401617e-01],
         [ 1.53281412e-02, -1.53281412e-02],
         [-1.13783404e-02,  1.13783404e-02],
         [ 7.05713637e-02, -7.05713637e-02],
         [-4.41577108e-02,  4.41577108e-02],
         [ 1.48411400e-01, -1.48411400e-01]],
 
        [[-5.70271412e-02,  5.70271412e-02],
         [-2.36582961e-02,  2.36582961e-02],
         [-4.15038688e-02,  4.15038688e-02],
         [-3.48109108e-03,  3.48109108e-03],
         [-6.44714400e-02,  6.44714400e-02],
         [-1.86816922e-02,  1.86816922e-02]],
 
        [[-6.36930468e-02,  6.36930468e-02],
         [ 9.50493382e-02, -9.50493382e-02],
         [-4.77279310e-02,  4.77279310e-02],
         [-5.63373806e-03,  5.63373806e-03],
         [-6.80438124e-02,  6.80438124e-02],
         [-2.87743393e-02,  2.87743393e-02]],
 
        [[-5.34747535e-02,  5.34747535e-02],
         [-3.17162180e-02,  3.17162180e-02],
         [-3.76145233e-02,  3.76145233e-02],
         [ 1.83455960e-02, -1.83455960e-02],
         [-7.89799211e-02,  7.89799211e-02],
         [ 8.46162906e-02, -8.46162906e-02]],
 
        [[ 3.68502375e-02, -3.68502375e-02],
         [-1.10728352e-02,  1.10728352e-02],
         [ 1.20929911e-01, -1.20929911e-01],
         [ 8.35737736e-02, -8.35737736e-02],
         [ 2.89126620e-01, -2.89126620e-01],
         [ 1.11768763e-01, -1.11768763e-01]],
 
        [[-5.80337427e-02,  5.80337427e-02],
         [-3.51713246e-02,  3.51713246e-02],
         [-3.96986603e-02,  3.96986603e-02],
         [-1.96845813e-03,  1.96845813e-03],
         [-7.37627538e-02,  7.37627538e-02],
         [-1.88589976e-04,  1.88589976e-04]],
 
        [[-5.84697061e-02,  5.84697061e-02],
         [-2.33043764e-02,  2.33043764e-02],
         [-7.74383211e-03,  7.74383211e-03],
         [ 3.89604942e-03, -3.89604942e-03],
         [-5.81764540e-02,  5.81764540e-02],
         [-2.50252101e-02,  2.50252101e-02]],
 
        [[ 1.45945307e-01, -1.45945307e-01],
         [ 8.97559251e-02, -8.97559251e-02],
         [ 1.75780297e-01, -1.75780297e-01],
         [ 2.42293696e-02, -2.42293696e-02],
         [ 2.40123137e-01, -2.40123137e-01],
         [ 1.05342435e-01, -1.05342435e-01]],
 
        [[-6.92980370e-02,  6.92980370e-02],
         [-6.00726929e-02,  6.00726929e-02],
         [-4.93514759e-02,  4.93514759e-02],
         [-2.72276985e-02,  2.72276985e-02],
         [ 2.42441525e-01, -2.42441525e-01],
         [-4.53151498e-02,  4.53151498e-02]],
 
        [[ 1.54544597e-01, -1.54544597e-01],
         [ 2.20545756e-01, -2.20545756e-01],
         [ 2.12440261e-03, -2.12440261e-03],
         [ 1.75522539e-02, -1.75522539e-02],
         [ 3.13186693e-01, -3.13186693e-01],
         [ 5.32227679e-02, -5.32227679e-02]],
 
        [[-3.47047062e-02,  3.47047062e-02],
         [-3.35509800e-02,  3.35509800e-02],
         [-4.53754719e-02,  4.53754719e-02],
         [-2.15031797e-02,  2.15031797e-02],
         [-2.45371369e-02,  2.45371369e-02],
         [-3.91520545e-02,  3.91520545e-02]],
 
        [[-5.37958443e-02,  5.37958443e-02],
         [-4.51532297e-02,  4.51532297e-02],
         [-3.72407464e-02,  3.72407464e-02],
         [ 3.05808276e-02, -3.05808276e-02],
         [-7.80999511e-02,  7.80999511e-02],
         [ 4.88541453e-03, -4.88541453e-03]],
 
        [[-4.47443764e-02,  4.47443764e-02],
         [ 2.62466008e-02, -2.62466008e-02],
         [-4.87473783e-02,  4.87473783e-02],
         [ 9.10437273e-04, -9.10437273e-04],
         [-7.02885080e-02,  7.02885080e-02],
         [-3.22003047e-02,  3.22003047e-02]],
 
        [[-3.07755956e-03,  3.07755956e-03],
         [-3.54187408e-02,  3.54187408e-02],
         [-4.72570700e-02,  4.72570700e-02],
         [-1.68429860e-02,  1.68429860e-02],
         [-7.29453002e-02,  7.29453002e-02],
         [-3.32818729e-02,  3.32818729e-02]],
 
        [[-4.64749023e-02,  4.64749023e-02],
         [-3.18401334e-02,  3.18401334e-02],
         [-3.92009498e-02,  3.92009498e-02],
         [-1.09952501e-02,  1.09952501e-02],
         [-5.01907736e-02,  5.01907736e-02],
         [-3.01215201e-02,  3.01215201e-02]],
 
        [[ 1.35865025e-02, -1.35865025e-02],
         [-3.36609858e-02,  3.36609858e-02],
         [-4.74008343e-02,  4.74008343e-02],
         [-2.21849526e-02,  2.21849526e-02],
         [-3.84845992e-02,  3.84845992e-02],
         [-5.06786600e-02,  5.06786600e-02]],
 
        [[-5.36768690e-02,  5.36768690e-02],
         [ 5.36719526e-03, -5.36719526e-03],
         [-3.83688579e-02,  3.83688579e-02],
         [-4.37552126e-03,  4.37552126e-03],
         [-5.43662371e-02,  5.43662371e-02],
         [-3.34032395e-02,  3.34032395e-02]],
 
        [[ 1.38013819e-01, -1.38013819e-01],
         [ 8.42321051e-02, -8.42321051e-02],
         [ 1.74314743e-01, -1.74314743e-01],
         [ 4.19096688e-02, -4.19096688e-02],
         [ 2.37474833e-01, -2.37474833e-01],
         [ 1.05231302e-01, -1.05231302e-01]],
 
        [[ 9.55114714e-02, -9.55114714e-02],
         [ 2.21973319e-01, -2.21973319e-01],
         [-8.10532836e-04,  8.10532836e-04],
         [ 1.28095656e-02, -1.28095656e-02],
         [ 3.25826099e-01, -3.25826099e-01],
         [ 1.58665484e-02, -1.58665484e-02]],
 
        [[-2.97952034e-03,  2.97952034e-03],
         [-3.49266773e-02,  3.49266773e-02],
         [-4.72570700e-02,  4.72570700e-02],
         [-1.52743585e-02,  1.52743585e-02],
         [-6.92557577e-02,  6.92557577e-02],
         [-3.91301456e-02,  3.91301456e-02]],
 
        [[-8.76757444e-03,  8.76757444e-03],
         [-3.96405330e-02,  3.96405330e-02],
         [-4.50008400e-02,  4.50008400e-02],
         [-9.40491662e-03,  9.40491662e-03],
         [-9.68369103e-03,  9.68369103e-03],
         [-4.63259744e-02,  4.63259744e-02]],
 
        [[-4.90025524e-02,  4.90025524e-02],
         [-3.22051823e-02,  3.22051823e-02],
         [-3.89613470e-02,  3.89613470e-02],
         [-1.10132810e-02,  1.10132810e-02],
         [-4.62538721e-02,  4.62538721e-02],
         [-3.13872945e-02,  3.13872945e-02]],
 
        [[ 1.29055622e-01, -1.29055622e-01],
         [-5.02155873e-02,  5.02155873e-02],
         [-4.21503139e-02,  4.21503139e-02],
         [-3.70542884e-02,  3.70542884e-02],
         [-4.81081558e-02,  4.81081558e-02],
         [-5.03508055e-02,  5.03508055e-02]]]),
 'X_sample':     percent_college  semi_urban_pct  median_household_income  percent_poverty  \
 43        28.832720        0.484252                    64044             13.6   
 40        30.288157        0.413043                    56360             13.9   
 46        26.959316        0.240602                    76471              9.9   
 12        35.848404        0.500000                    60830             11.0   
 24        32.042648        0.536585                    45928             19.5   
 31        31.799431        0.606061                    52021             17.5   
 17        29.152100        0.408333                    52256             16.0   
 32        24.268467        0.370968                    72038             13.1   
 3         29.507084        0.560000                    49020             16.0   
 30        22.916826        0.000000                    85786              9.1   
 13        28.604910        0.509804                    69212             11.4   
 8         15.547361        0.000000                    90395             14.1   
 49        31.452776        0.458333                    64177             10.4   
 6         24.491169        0.125000                    78920              9.9   
 47        33.307690        0.333333                    78674              9.8   
 4         28.893970        0.293103                    80423             11.8   
 36        31.172707        0.558442                    54447             15.1   
 33        30.872301        0.380000                    57388             13.6   
 19        29.324507        0.562500                    58824             10.9   
 48        25.967607        0.418182                    48659             16.2   
 15        32.542377        0.585859                    61807             11.0   
 9         29.736067        0.313433                    59198             12.7   
 16        31.660761        0.419048                    62028             11.3   
 26        32.688599        0.392857                    57248             12.6   
 44        35.426731        0.482759                    75705              8.8   
 25        30.086369        0.443478                    57375             12.9   
 11        31.638309        0.400000                    83734              9.0   
 0         29.912098        0.402985                    51771             15.6   
 45        25.852291        0.571429                    63293             10.1   
 27        33.386356        0.311828                    63290              9.9   
 34        36.171799        0.188679                    67402             10.5   
 5         29.465919        0.421875                    77104              9.4   
 29        28.643383        0.700000                    78571              7.5   
 37        34.312252        0.500000                    66955             11.5   
 1         35.292122        0.310345                    77203             10.2   
 21        23.049395        0.214286                    85700              9.5   
 2         33.813610        0.466667                    62027             13.5   
 39        26.346718        0.000000                    70383             11.6   
 35        29.063951        0.545455                    58704             13.0   
 23        32.411053        0.471264                    74529              8.9   
 41        32.668640        0.242424                    60414             11.9   
 10        28.107138        0.396226                    61950             13.5   
 22        32.799744        0.518072                    59522             12.9   
 18        27.149839        0.375000                    51108             18.8   
 50        36.730377        0.739130                    66152              9.9   
 20        25.420776        0.208333                    86644              9.1   
 7         26.731159        0.000000                    70348             11.2   
 42        28.035902        0.389474                    56047             13.8   
 14        28.955311        0.467391                    57617             11.9   
 28        33.873901        0.529412                    63268             12.7   
 38        24.395906        0.388060                    63455             12.0   
 
     rural_pct  percent_bachelor  
 43   0.192913         29.896368  
 40   0.021739         28.119732  
 46   0.157895         38.781448  
 12   0.227273         27.568047  
 24   0.256098         22.025591  
 31   0.181818         27.328411  
 17   0.300000         24.215286  
 32   0.016129         36.574459  
 3    0.173333         23.027790  
 30   0.000000         39.713642  
 13   0.098039         34.652561  
 8    0.000000         58.540707  
 49   0.180556         30.116730  
 6    0.000000         39.284241  
 47   0.128205         36.020138  
 4    0.068966         33.925964  
 36   0.207792         25.520313  
 33   0.160000         31.255686  
 19   0.125000         31.812195  
 48   0.200000         20.614605  
 15   0.202020         28.566288  
 9    0.029851         29.879576  
 16   0.400000         33.384693  
 26   0.517857         32.029671  
 44   0.172414         34.017944  
 25   0.260870         29.218016  
 11   0.000000         32.977276  
 0    0.164179         25.468332  
 45   0.214286         38.024567  
 27   0.548387         31.911806  
 34   0.698113         30.047791  
 5    0.312500         40.912342  
 29   0.000000         37.042934  
 37   0.138889         33.664604  
 1    0.586207         29.551214  
 21   0.000000         43.688908  
 2    0.000000         29.466806  
 39   0.000000         34.199402  
 35   0.022727         28.277195  
 23   0.218391         36.081844  
 41   0.636364         28.840324  
 10   0.138365         31.323006  
 22   0.168675         29.135920  
 18   0.078125         24.113539  
 50   0.173913         27.362514  
 20   0.000000         40.172043  
 7    0.000000         31.993366  
 42   0.168421         27.338949  
 14   0.054348         26.456558  
 28   0.235294         24.731855  
 38   0.059701         31.429663  }

7. Exports vers le format HTML¶

In [243]:
!jupyter nbconvert notebook_states.ipynb --to html
[NbConvertApp] Converting notebook notebook_states.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 28 image(s).
[NbConvertApp] Writing 49395640 bytes to notebook_states.html